Count distinct, group by, with rollup issue

I’m working on pulling a report and my numbers aren’t adding up. I am trying to pull the number of (unique) clients in 6 different age groups that had non-voided vouchers in a date range.

Here’s what I’ve got so far:

SELECT COUNT(DISTINCT V.clientId)
FROM ClientManagement_Client_Vouchers V
INNER JOIN ClientManagement_Client_Voucher_Items VI
    ON V.id = VI.voucherId
INNER JOIN ClientManagement_Clients C
    ON V.clientId = C.id
WHERE V.dateVoided IS NULL
AND V.dateCreated >= '2011-01-01'
AND V.dateCreated <= '2011-06-30'
GROUP BY 
    DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) < 5,
    
    (    DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 5 AND 
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 9
    ),
    
    (
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 10 AND 
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 18
    ),
    
    (
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 19 AND 
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 59
    ),
    
    (
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 60 AND 
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 64
    ),
    
    DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 65

WITH ROLLUP

This gives the following rows:

186
186
161
161
343
1547
1547
1547
1887
186
186
186
186
2072
108
108
108
108
108
2180
73
73
73
73
73
73
2252

The bottom number is correct (2252 total unique clients for the criteria) - but none of the other numbers add up: for example, 186 + 161 = 347, not the 343 that the rollup is showing.
taking the ‘with rollup’ clause out, i’m left with simply:

186
161
1547
186
108
73

which adds up to 2261, not the correct number 2252.

Where am I going wrong?

that is the gosh-darnedest weirdest GROUP BY i’ve ever seen

can you explain it in words?

Ha - Rudy here is what I expect that to translate to in English, and what I expect that to do:

Select all the unique clients that had vouchers within a date range. Make sure each of those vouchers was not voided, and make sure they each had voucher items. From this array of unique clients, count the number that were less than 5 years old, between 6 and 9, … etc. at the time the voucher was created.

The long group by statements are just some SQL I was recommended to calculate a persons age from a certain date - so in this case I’m trying to get totals of how many people were in certain age groups at the time voucher was created.

Does that help at all?

let’s do this in two steps

first step is to eliminate the duplication caused by multiple voucher items per voucher, by using an EXISTS subquery

we also set the stage for the second step by calculating the age – only once :wink:

SELECT V.clientId
     , YEAR(V.dateCreated) - YEAR(C.dob) - 
         ( DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
           DATE_FORMAT(C.dob, '00-%m-%d') )  AS age
  FROM ClientManagement_Client_Vouchers AS V
INNER 
  JOIN ClientManagement_Clients AS C
    ON C.id = V.clientId
 WHERE V.dateVoided IS NULL
   AND V.dateCreated >= '2011-01-01'
   AND V.dateCreated <= '2011-06-30'
   AND EXISTS ( SELECT 1
                  FROM ClientManagement_Client_Voucher_Items VI
                 WHERE VI.voucherId = V.id )

make sure this returns just one row per client

add additional columns to the SELECT clause temporarily to assure yourself the query is working correctly

Rudy - thank you very much for the help. I won’t be able to try this on my db until Monday morning, but I’ll post the results as soon as I can when I get to work. Thanks for taking the time - I appreciate it.

Hi Rudy - I was able to test it this morning. It is returning multiple rows per client (6,368 total rows) due to the lack of the distinct clause, but maybe that’s in part 2? Also, when I add the DISTINCT clause it returns 2,382 rows - not the correct number 2,252. When I remove the calculated age column from the select I get the correct number.

Edit: I discovered that the reason for the extra rows when adding the DISTINCT clause is that 130 clients had birthdays in between being issued multiple vouchers for that time frame - so they received vouchers when they were 2 different ages and the query is including an extra row for each client’s age

i am having a hard time wrapping my head around this improbable news

wait a sec, let me rephrase that

i understand that a person can have a birthday during the specified V.dateCreated range, i was just temporarily gobsmacked by the “when they were 2 different ages” part

so now i toss it back to you – what’s the salient fact here? the existence of the client uniquely on the report (in which case you need to settle on a single age for the client, e.g. max age in the date range), or is it the exact age at the time of the voucher (in which case the client will show up on the report more than once)?

Yeah - this is fun stuff, huh? I’ve been dealing with this &#*!$ on 20+ different reports for months, and raised all these issues to my client. They don’t seem to understand or care which duplicates are thrown out, or which age is used - so long as the bottom line numbers add up. So let me pick an option and rephrase what I’m looking for:

Select all the unique clients that had vouchers within a date range. Make sure each of those vouchers was not voided, and make sure they each had voucher items. From this array of unique clients, calculate their age using the most recent voucher issued to them within the date range, and find the number that were less than 5 years old, between 5 and 9, … etc.

that’s more like it :slight_smile:

unfortunately i can’t get to this until tomorrow, but if you’ve been at this for months, then an extra day shouldn’t hurt

Off Topic:

interesting that i should have said that, as i just finished posting the following as my facebook status:[indent]i can only please one person per day, and TODAY IS NOT YOUR DAY (tomorrow ain’t looking too good either)[/indent]:wink:

Holy crap - I think I cobbled together a working version:

SELECT    COUNT(DISTINCT V.clientId) AS ageCount,
        YEAR(CV.maxDate) - YEAR(C.dob) -
            (    DATE_FORMAT(CV.maxDate, '00-%m-%d') <
                DATE_FORMAT(C.dob, '00-%m-%d')
            ) AS age
FROM ClientManagement_Client_Vouchers V
INNER JOIN ClientManagement_Clients C
    ON V.clientId = C.id
INNER JOIN (
    SELECT V2.dateCreated AS maxDate, C2.id AS clientId
    FROM ClientManagement_Client_Vouchers V2
    INNER JOIN ClientManagement_Clients C2
        ON V2.clientId = C2.id
    WHERE V2.dateVoided IS NULL
    AND V2.dateCreated >= '2011-01-01'
    AND V2.dateCreated <= '2011-06-30'
    AND EXISTS (     SELECT 1
                    FROM ClientManagement_Client_Voucher_Items VI2
                    WHERE VI2.voucherId = V2.id
    )
    GROUP BY clientId
    ORDER BY maxDate DESC
) AS CV
    ON CV.clientId = C.id
WHERE V.dateVoided IS NULL
AND V.dateCreated >= '2011-01-01'
AND V.dateCreated <= '2011-06-30'
AND EXISTS (     SELECT 1
                FROM ClientManagement_Client_Voucher_Items VI
                WHERE VI.voucherId = V.id
)

GROUP BY
    age < 5,
    age >= 5 AND age <= 9,
    age >= 10 AND age <= 18,
    age >= 19 AND age <= 59,
    age >= 60 AND age <= 64,
    age >= 65
ORDER BY age ASC

I get:

ageCount     age
--------    ----
73             1
107         9
186         14
1547         36
157         62
182         66

Which adds up!

I don’t really like the redundancy in the INNER JOIN, but I’m not sure it can be helped?

okay, let me get back to your revised definition – calculate their age using the most recent voucher issued to them within the date range

this query, which i gave yesterday, results in multiple rows per client when a client has qualifying vouchers both before and after his birthday in the specified voucher date range –


SELECT V.clientId
     , YEAR(V.dateCreated) - YEAR(C.dob) - 
         ( DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
           DATE_FORMAT(C.dob, '00-%m-%d') )  AS age
  FROM ClientManagement_Client_Vouchers AS V
INNER 
  JOIN ClientManagement_Clients AS C
    ON C.id = V.clientId
 WHERE V.dateVoided IS NULL
   AND V.dateCreated >= '2011-01-01'
   AND V.dateCreated <= '2011-06-30'
   AND EXISTS ( SELECT 1
                  FROM ClientManagement_Client_Voucher_Items VI
                 WHERE VI.voucherId = V.id )

notice that all qualifying vouchers are returned, but there is an age calculation applied to each one

based on your revised definition, to use the most recent voucher, all we have to do is take the MAX voucher date, but we will postpone the age calculation for a moment –


SELECT V.clientId
     , MAX(V.dateCreated) AS most_recent
  FROM ClientManagement_Client_Vouchers AS V
INNER 
  JOIN ClientManagement_Clients AS C
    ON C.id = V.clientId
 WHERE V.dateVoided IS NULL
   AND V.dateCreated >= '2011-01-01'
   AND V.dateCreated <= '2011-06-30'
   AND EXISTS ( SELECT 1
                  FROM ClientManagement_Client_Voucher_Items VI
                 WHERE VI.voucherId = V.id )
GROUP
    BY V.clientId

since you are only interested in counting clients in age brackets, and not on any of the voucher details, we can push this query down into a subquery, and that will allow us to calculate the age based on the most recent date –


SELECT clientId
     , YEAR(most_recent) - YEAR(dob) - 
         ( DATE_FORMAT(most_recent, '00-%m-%d') < 
           DATE_FORMAT(dob, '00-%m-%d') )  AS age
  FROM ( SELECT V.clientId
              , C.dob
              , MAX(V.dateCreated) AS most_recent
           FROM ClientManagement_Client_Vouchers AS V
         INNER 
           JOIN ClientManagement_Clients AS C
             ON C.id = V.clientId
          WHERE V.dateVoided IS NULL
            AND V.dateCreated >= '2011-01-01'
            AND V.dateCreated <= '2011-06-30'
            AND EXISTS ( SELECT 1
                           FROM ClientManagement_Client_Voucher_Items VI
                          WHERE VI.voucherId = V.id )
         GROUP
             BY V.clientId
              , C.dob ) AS s1

notice that i’ve added C.dob to the SELECT in the subquery, so that it is available to the outer query

it’s also in the GROUP BY clause, which isn’t actually necessary in MySQL, but it is in standard SQL, but that’s a different topic

anyhow, now we have a query which returns the client and the age, and we can now push this down into a subquery so that we can calculate age brackets –


SELECT COUNT(CASE WHEN age < 5
                  THEN age END ) AS '< 5'
     , COUNT(CASE WHEN age BETWEEN 5 AND 9
                  THEN age END ) AS '5 - 9'
     , COUNT(CASE WHEN age BETWEEN 10 AND 18
                  THEN age END ) AS '10 - 18'
     , COUNT(CASE WHEN age BETWEEN 19 AND 59
                  THEN age END ) AS '19 - 59'
     , COUNT(CASE WHEN age BETWEEN 60 AND 64
                  THEN age END ) AS '60 - 64'
     , COUNT(CASE WHEN age > 64
                  THEN age END ) AS '> 64'
     , COUNT(*) AS total
  FROM ( SELECT clientId
              , YEAR(most_recent) - YEAR(dob) - 
                  ( DATE_FORMAT(most_recent, '00-%m-%d') < 
                    DATE_FORMAT(dob, '00-%m-%d') )  AS age
           FROM ( SELECT V.clientId
                       , C.dob
                       , MAX(V.dateCreated) AS most_recent
                    FROM ClientManagement_Client_Vouchers AS V
                  INNER 
                    JOIN ClientManagement_Clients AS C
                      ON C.id = V.clientId
                   WHERE V.dateVoided IS NULL
                     AND V.dateCreated >= '2011-01-01'
                     AND V.dateCreated <= '2011-06-30'
                     AND EXISTS ( SELECT 1
                                    FROM ClientManagement_Client_Voucher_Items VI
                                   WHERE VI.voucherId = V.id )
                  GROUP
                      BY V.clientId
                       , C.dob ) AS s1 ) AS s2

please run that and please show me the output it produces

:cool:

Rudy - that works great. Output:

< 5     5 - 9   10 - 18     19 - 59     60 - 64     > 64    total
---     -----   -------     -------     -------     ----    ------
72      108     185         1545        156         186     2252

Some of the columns have slightly different numbers than the query I put together above, but I’m inclined to believe that yours is the more accurate :wink:

Thanks so much for your help - I definitely would not have come up with that on my own.