SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    count distinct, group by, with rollup issue

    I'm working on pulling a report and my numbers aren't adding up. I am trying to pull the number of (unique) clients in 6 different age groups that had non-voided vouchers in a date range.

    Here's what I've got so far:

    Code MySQL:
    SELECT COUNT(DISTINCT V.clientId)
    FROM ClientManagement_Client_Vouchers V
    INNER JOIN ClientManagement_Client_Voucher_Items VI
        ON V.id = VI.voucherId
    INNER JOIN ClientManagement_Clients C
        ON V.clientId = C.id
    WHERE V.dateVoided IS NULL
    AND V.dateCreated >= '2011-01-01'
    AND V.dateCreated <= '2011-06-30'
    GROUP BY 
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) < 5,
     
        (    DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 5 AND 
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 9
        ),
     
        (
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 10 AND 
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 18
        ),
     
        (
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 19 AND 
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 59
        ),
     
        (
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 60 AND 
            DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) <= 64
        ),
     
        DATE_FORMAT(V.dateCreated, '%Y') - DATE_FORMAT(C.dob, '%Y') - (DATE_FORMAT(V.dateCreated, '00-%m-%d') < DATE_FORMAT(C.dob, '00-%m-%d')) >= 65
     
    WITH ROLLUP

    This gives the following rows:

    186
    186
    161
    161
    343
    1547
    1547
    1547
    1887
    186
    186
    186
    186
    2072
    108
    108
    108
    108
    108
    2180
    73
    73
    73
    73
    73
    73
    2252

    The bottom number is correct (2252 total unique clients for the criteria) - but none of the other numbers add up: for example, 186 + 161 = 347, not the 343 that the rollup is showing.
    taking the 'with rollup' clause out, i'm left with simply:

    186
    161
    1547
    186
    108
    73

    which adds up to 2261, not the correct number 2252.

    Where am I going wrong?
    aaron-fisher.com - PHP articles and more

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    that is the gosh-darnedest weirdest GROUP BY i've ever seen

    can you explain it in words?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ha - Rudy here is what I expect that to translate to in English, and what I expect that to do:

    Select all the unique clients that had vouchers within a date range. Make sure each of those vouchers was not voided, and make sure they each had voucher items. From this array of unique clients, count the number that were less than 5 years old, between 6 and 9, .... etc. at the time the voucher was created.

    The long group by statements are just some SQL I was recommended to calculate a persons age from a certain date - so in this case I'm trying to get totals of how many people were in certain age groups at the time voucher was created.

    Does that help at all?
    aaron-fisher.com - PHP articles and more

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    let's do this in two steps

    first step is to eliminate the duplication caused by multiple voucher items per voucher, by using an EXISTS subquery

    we also set the stage for the second step by calculating the age -- only once
    Code:
    SELECT V.clientId
         , YEAR(V.dateCreated) - YEAR(C.dob) - 
             ( DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
               DATE_FORMAT(C.dob, '00-%m-%d') )  AS age
      FROM ClientManagement_Client_Vouchers AS V
    INNER 
      JOIN ClientManagement_Clients AS C
        ON C.id = V.clientId
     WHERE V.dateVoided IS NULL
       AND V.dateCreated >= '2011-01-01'
       AND V.dateCreated <= '2011-06-30'
       AND EXISTS ( SELECT 1
                      FROM ClientManagement_Client_Voucher_Items VI
                     WHERE VI.voucherId = V.id )
    make sure this returns just one row per client

    add additional columns to the SELECT clause temporarily to assure yourself the query is working correctly
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Rudy - thank you very much for the help. I won't be able to try this on my db until Monday morning, but I'll post the results as soon as I can when I get to work. Thanks for taking the time - I appreciate it.
    aaron-fisher.com - PHP articles and more

  6. #6
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    let's do this in two steps

    first step is to eliminate the duplication caused by multiple voucher items per voucher, by using an EXISTS subquery

    we also set the stage for the second step by calculating the age -- only once
    Code:
    SELECT V.clientId
         , YEAR(V.dateCreated) - YEAR(C.dob) - 
             ( DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
               DATE_FORMAT(C.dob, '00-%m-%d') )  AS age
      FROM ClientManagement_Client_Vouchers AS V
    INNER 
      JOIN ClientManagement_Clients AS C
        ON C.id = V.clientId
     WHERE V.dateVoided IS NULL
       AND V.dateCreated >= '2011-01-01'
       AND V.dateCreated <= '2011-06-30'
       AND EXISTS ( SELECT 1
                      FROM ClientManagement_Client_Voucher_Items VI
                     WHERE VI.voucherId = V.id )
    make sure this returns just one row per client

    add additional columns to the SELECT clause temporarily to assure yourself the query is working correctly
    Hi Rudy - I was able to test it this morning. It is returning multiple rows per client (6,368 total rows) due to the lack of the distinct clause, but maybe that's in part 2? Also, when I add the DISTINCT clause it returns 2,382 rows - not the correct number 2,252. When I remove the calculated age column from the select I get the correct number.

    Edit: I discovered that the reason for the extra rows when adding the DISTINCT clause is that 130 clients had birthdays in between being issued multiple vouchers for that time frame - so they received vouchers when they were 2 different ages and the query is including an extra row for each client's age
    aaron-fisher.com - PHP articles and more

  7. #7
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by aamonkey View Post
    so they received vouchers when they were 2 different ages and the query is including an extra row for each client's age
    i am having a hard time wrapping my head around this improbable news
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  8. #8
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    wait a sec, let me rephrase that

    i understand that a person can have a birthday during the specified V.dateCreated range, i was just temporarily gobsmacked by the "when they were 2 different ages" part

    so now i toss it back to you -- what's the salient fact here? the existence of the client uniquely on the report (in which case you need to settle on a single age for the client, e.g. max age in the date range), or is it the exact age at the time of the voucher (in which case the client will show up on the report more than once)?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  9. #9
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    wait a sec, let me rephrase that

    i understand that a person can have a birthday during the specified V.dateCreated range, i was just temporarily gobsmacked by the "when they were 2 different ages" part

    so now i toss it back to you -- what's the salient fact here? the existence of the client uniquely on the report (in which case you need to settle on a single age for the client, e.g. max age in the date range), or is it the exact age at the time of the voucher (in which case the client will show up on the report more than once)?
    Yeah - this is fun stuff, huh? I've been dealing with this &#*!$ on 20+ different reports for months, and raised all these issues to my client. They don't seem to understand or care which duplicates are thrown out, or which age is used - so long as the bottom line numbers add up. So let me pick an option and rephrase what I'm looking for:

    Select all the unique clients that had vouchers within a date range. Make sure each of those vouchers was not voided, and make sure they each had voucher items. From this array of unique clients, calculate their age using the most recent voucher issued to them within the date range, and find the number that were less than 5 years old, between 5 and 9, .... etc.
    aaron-fisher.com - PHP articles and more

  10. #10
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    that's more like it

    unfortunately i can't get to this until tomorrow, but if you've been at this for months, then an extra day shouldn't hurt

    Off Topic:

    interesting that i should have said that, as i just finished posting the following as my facebook status:
    i can only please one person per day, and TODAY IS NOT YOUR DAY (tomorrow ain't looking too good either)
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  11. #11
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Holy crap - I think I cobbled together a working version:

    Code MySQL:
    SELECT    COUNT(DISTINCT V.clientId) AS ageCount,
            YEAR(CV.maxDate) - YEAR(C.dob) -
                (    DATE_FORMAT(CV.maxDate, '00-%m-%d') <
                    DATE_FORMAT(C.dob, '00-%m-%d')
                ) AS age
    FROM ClientManagement_Client_Vouchers V
    INNER JOIN ClientManagement_Clients C
        ON V.clientId = C.id
    INNER JOIN (
        SELECT V2.dateCreated AS maxDate, C2.id AS clientId
        FROM ClientManagement_Client_Vouchers V2
        INNER JOIN ClientManagement_Clients C2
            ON V2.clientId = C2.id
        WHERE V2.dateVoided IS NULL
        AND V2.dateCreated >= '2011-01-01'
        AND V2.dateCreated <= '2011-06-30'
        AND EXISTS (     SELECT 1
                        FROM ClientManagement_Client_Voucher_Items VI2
                        WHERE VI2.voucherId = V2.id
        )
        GROUP BY clientId
        ORDER BY maxDate DESC
    ) AS CV
        ON CV.clientId = C.id
    WHERE V.dateVoided IS NULL
    AND V.dateCreated >= '2011-01-01'
    AND V.dateCreated <= '2011-06-30'
    AND EXISTS (     SELECT 1
                    FROM ClientManagement_Client_Voucher_Items VI
                    WHERE VI.voucherId = V.id
    )
     
    GROUP BY
        age < 5,
        age >= 5 AND age <= 9,
        age >= 10 AND age <= 18,
        age >= 19 AND age <= 59,
        age >= 60 AND age <= 64,
        age >= 65
    ORDER BY age ASC

    I get:
    Code:
    ageCount     age
    --------    ----
    73             1
    107         9
    186         14
    1547         36
    157         62
    182         66
    Which adds up!

    I don't really like the redundancy in the INNER JOIN, but I'm not sure it can be helped?
    aaron-fisher.com - PHP articles and more

  12. #12
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,215
    Mentioned
    58 Post(s)
    Tagged
    3 Thread(s)
    okay, let me get back to your revised definition -- calculate their age using the most recent voucher issued to them within the date range

    this query, which i gave yesterday, results in multiple rows per client when a client has qualifying vouchers both before and after his birthday in the specified voucher date range --
    Code:
    SELECT V.clientId
         , YEAR(V.dateCreated) - YEAR(C.dob) - 
             ( DATE_FORMAT(V.dateCreated, '00-%m-%d') < 
               DATE_FORMAT(C.dob, '00-%m-%d') )  AS age
      FROM ClientManagement_Client_Vouchers AS V
    INNER 
      JOIN ClientManagement_Clients AS C
        ON C.id = V.clientId
     WHERE V.dateVoided IS NULL
       AND V.dateCreated >= '2011-01-01'
       AND V.dateCreated <= '2011-06-30'
       AND EXISTS ( SELECT 1
                      FROM ClientManagement_Client_Voucher_Items VI
                     WHERE VI.voucherId = V.id )
    notice that all qualifying vouchers are returned, but there is an age calculation applied to each one

    based on your revised definition, to use the most recent voucher, all we have to do is take the MAX voucher date, but we will postpone the age calculation for a moment --
    Code:
    SELECT V.clientId
         , MAX(V.dateCreated) AS most_recent
      FROM ClientManagement_Client_Vouchers AS V
    INNER 
      JOIN ClientManagement_Clients AS C
        ON C.id = V.clientId
     WHERE V.dateVoided IS NULL
       AND V.dateCreated >= '2011-01-01'
       AND V.dateCreated <= '2011-06-30'
       AND EXISTS ( SELECT 1
                      FROM ClientManagement_Client_Voucher_Items VI
                     WHERE VI.voucherId = V.id )
    GROUP
        BY V.clientId
    since you are only interested in counting clients in age brackets, and not on any of the voucher details, we can push this query down into a subquery, and that will allow us to calculate the age based on the most recent date --
    Code:
    SELECT clientId
         , YEAR(most_recent) - YEAR(dob) - 
             ( DATE_FORMAT(most_recent, '00-%m-%d') < 
               DATE_FORMAT(dob, '00-%m-%d') )  AS age
      FROM ( SELECT V.clientId
                  , C.dob
                  , MAX(V.dateCreated) AS most_recent
               FROM ClientManagement_Client_Vouchers AS V
             INNER 
               JOIN ClientManagement_Clients AS C
                 ON C.id = V.clientId
              WHERE V.dateVoided IS NULL
                AND V.dateCreated >= '2011-01-01'
                AND V.dateCreated <= '2011-06-30'
                AND EXISTS ( SELECT 1
                               FROM ClientManagement_Client_Voucher_Items VI
                              WHERE VI.voucherId = V.id )
             GROUP
                 BY V.clientId
                  , C.dob ) AS s1
    notice that i've added C.dob to the SELECT in the subquery, so that it is available to the outer query

    it's also in the GROUP BY clause, which isn't actually necessary in MySQL, but it is in standard SQL, but that's a different topic

    anyhow, now we have a query which returns the client and the age, and we can now push this down into a subquery so that we can calculate age brackets --
    Code:
    SELECT COUNT(CASE WHEN age < 5
                      THEN age END ) AS '< 5'
         , COUNT(CASE WHEN age BETWEEN 5 AND 9
                      THEN age END ) AS '5 - 9'
         , COUNT(CASE WHEN age BETWEEN 10 AND 18
                      THEN age END ) AS '10 - 18'
         , COUNT(CASE WHEN age BETWEEN 19 AND 59
                      THEN age END ) AS '19 - 59'
         , COUNT(CASE WHEN age BETWEEN 60 AND 64
                      THEN age END ) AS '60 - 64'
         , COUNT(CASE WHEN age > 64
                      THEN age END ) AS '> 64'
         , COUNT(*) AS total
      FROM ( SELECT clientId
                  , YEAR(most_recent) - YEAR(dob) - 
                      ( DATE_FORMAT(most_recent, '00-%m-%d') < 
                        DATE_FORMAT(dob, '00-%m-%d') )  AS age
               FROM ( SELECT V.clientId
                           , C.dob
                           , MAX(V.dateCreated) AS most_recent
                        FROM ClientManagement_Client_Vouchers AS V
                      INNER 
                        JOIN ClientManagement_Clients AS C
                          ON C.id = V.clientId
                       WHERE V.dateVoided IS NULL
                         AND V.dateCreated >= '2011-01-01'
                         AND V.dateCreated <= '2011-06-30'
                         AND EXISTS ( SELECT 1
                                        FROM ClientManagement_Client_Voucher_Items VI
                                       WHERE VI.voucherId = V.id )
                      GROUP
                          BY V.clientId
                           , C.dob ) AS s1 ) AS s2
    please run that and please show me the output it produces

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  13. #13
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Rudy - that works great. Output:

    Code:
    < 5     5 - 9   10 - 18     19 - 59     60 - 64     > 64    total
    ---     -----   -------     -------     -------     ----    ------
    72      108     185         1545        156         186     2252
    Some of the columns have slightly different numbers than the query I put together above, but I'm inclined to believe that yours is the more accurate

    Thanks so much for your help - I definitely would not have come up with that on my own.
    aaron-fisher.com - PHP articles and more


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •