SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Enthusiast
    Join Date
    Mar 2002
    Location
    Italy
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Efficient table design and query for very particular case

    Hi,
    I have long been a fan of SPFs and have posted numerous questions all receiving very helpful answers. So I will try something more challenging this time. Now, before anybody thinks I am throwing a theoretical problem, this is not the case.

    Let's imagine we have a very large population (approx 1,000,000) of which we know the first name and the age. All individual population units can be part of no group or up to 10 different groups each of which can be between 2 and 20 members large. We know to which groups each unit of the population belongs to, if any. To further complicate the situation the population is growing daily.

    I need to query this dataset to find groups in which at least three memebrs with a specific name have an age of x, y and z years plus or minus w. So for example I need to find groups in which there is a John who is 20 (+/- 5 years), an Andrew who is 36 (+/- 5 years) and a Lisa who is 7 (+/- 5 years). The group itself can be larger, but this is irrelevant.

    The question is: what is the most efficient way of organising this data and what is the most efficient way of querying it?

    I really appreciate any help you can give me.


    Thanks,
    Adrien

  2. #2
    SitePoint Enthusiast
    Join Date
    Mar 2002
    Location
    Italy
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The best I have managed to come up with is to have two tables.
    One which contains all the units in the population, their age and the group they belong to.
    The second table contains all possible 3 people permutations of the members forming that specific group.

    CREATE TABLE population (
    `name` varchar(30) NOT NULL,
    age smallint(6) NOT NULL,
    group_id int(11) NOT NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;



    CREATE TABLE groups (
    name_1 varchar(25) NOT NULL,
    name_2 varchar(25) NOT NULL,
    name_3 varchar(25) NOT NULL,
    age_1 smallint(5) unsigned NOT NULL,
    age_2 smallint(5) unsigned NOT NULL,
    age_3 smallint(5) unsigned NOT NULL,
    group_id int(10) unsigned NOT NULL,
    KEY name_1 (name_1),
    KEY name_2 (name_2),
    KEY name_3 (name_3),
    KEY age_1 (age_1),
    KEY age_2 (age_2),
    KEY age_3 (age_3)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;


    I would then query the second table with a query like:
    SELECT group_id FROM groups WHERE name_1 = 'John' AND age_1 BETWEEN 15 AND 25 AND name_2 = 'Andrew' AND age_2 BETWEEN 31 AND 41 AND name_3 = 'Lisa' AND age_3 BETWEEN 2 AND 12

    I do however feel that is is greatly inefficient: any ideas?

  3. #3
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,194
    Mentioned
    17 Post(s)
    Tagged
    4 Thread(s)
    Couldn't you just use your first table and use a single query to dynamically calculate the group?

    Code SQL:
    SELECT
          CASE 
             WHEN p.name = 'John' AND p.age BETWEEN 15 AND 25
             THEN 'a'
     
             WHEN p.name = 'Andrew' AND p.age BETWEEN 31 AND 41
             THEN 'b'
     
             ELSE 'c' END grp
      FROM
          population p
     WHERE
          (p.name = 'John' AND p.age BETWEEN 15 AND 25)
        OR
          (p.name = 'Andrew' AND p.age BETWEEN 31 AND 41)
        OR
          (p.name = 'Lisa' AND p.age BETWEEN 2 AND 12)
     GROUP
        BY
         grp
    HAVING
         COUNT(*) > 2

    To me it doesn't make sense to use an actual table for groups since age is a dynamic property of a person. That seems better left to be resolved in the query based on what age range and person name you would like to filter.
    The only code I hate more than my own is everyone else's.

  4. #4
    SitePoint Enthusiast
    Join Date
    Mar 2002
    Location
    Italy
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi oddz,
    Thank you very much for your quick reply.
    I have to admit that your SQL is a little too advanced for me so I'm not sure I follow it. However, following your line of thought I've packaged a query to act straight on the first table. I however unsure if my or your query would be more efficient (ie use less resources):
    Code MySQL:
    SELECT d.group_id FROM (
        SELECT b.group_id FROM (
            SELECT group_id FROM population 
            WHERE name = 'Lisa' 
            AND age BETWEEN 2 AND 12
        ) a, population b 
        WHERE  name = 'Andrew' 
        AND age BETWEEN 31 AND 41 
        AND a.group_id = b.group_id
    ) c, population d 
    WHERE name = 'John' 
    AND age BETWEEN 15 
    AND 25 AND c.group_id = d.group_id


    Thanks,
    Adrien

  5. #5
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,194
    Mentioned
    17 Post(s)
    Tagged
    4 Thread(s)
    I don't think that query does what you want it to do. That is considering that the most inner query results in selecting only row with name equal to Lisa between age 2 and 12. You than attempt to filter on name Andrew when all there will be are Lisa's – make sense?

    In regards to optimization the query I showed you should be fine. If it really becomes an issue you can do some caching server side or something to store the flat result set eliminating the database trip altogether once the data has been resolved once.

    The whole philosophy with the query I showed you is to remove all the people that you don't want through the where clause. Once that filter has been applied you will be left with all the people but won't have any "groups". Creating groups is than done by associating each row with a imaginary group. That is done so that the aggregate filter can than be applied through use of group by on the calculated column used to identify a group of people.
    The only code I hate more than my own is everyone else's.

  6. #6
    SitePoint Enthusiast
    Join Date
    Mar 2002
    Location
    Italy
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi oddz,
    Once more thank you for your help. As i mentioned I am not familiar with CASE and WHEN. Would you be able to run me through your query or point me to a resource that explains how these instructions work please?


    Thanks,
    Adrien

  7. #7
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,194
    Mentioned
    17 Post(s)
    Tagged
    4 Thread(s)
    Case is just a simple switch statement.

    Broken up into simple language the case statement provided in the query means:

    When name equals John and Johns age is 15 – 25 set this columns value equal to 'a'
    Otherwise if name equals Andrew and Andrews age is 31 – 41 set this columns value equal to 'b'
    If none of those conditions are matched assume the person is Lisa with age of 2–12 and set the columns value equal to 'c'.

    The case statement is going to applied after the where clause. So the where the clause is going to eliminate all the people who don't match any of the criteria. The case statement is than used to place each person into the appropriate group. That eliminates the need for another table and makes it possible to filter on the aggregate calculation using a having clause.
    The only code I hate more than my own is everyone else's.

  8. #8
    SitePoint Enthusiast
    Join Date
    Mar 2002
    Location
    Italy
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi oddz,
    Does your query take into account that there are thousands of names in the "population" table and that the trhee I mentioned are just a fraction of them?


    Adrien

  9. #9
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2006
    Location
    Augusta, Georgia, United States
    Posts
    4,194
    Mentioned
    17 Post(s)
    Tagged
    4 Thread(s)
    Yeah, I mean its a pretty simple query. You shouldn't have much trouble with it.

    Here is another rearrangement that does the same thing:

    Code SQL:
    SELECT
          CASE
     
             WHEN p.name = 'John' AND p.age BETWEEN 15 AND 25
             THEN 1
     
             WHEN p.name = 'Andrew' AND p.age BETWEEN 31 AND 41
             THEN 2
     
             WHEN p.name = 'Lisa' AND p.age BETWEEN 2 AND 12
             THEN 3
     
           ELSE 0 END grp
      FROM
         population
     WHERE
         p.name IN ('John','Andrew','Lisa')
       AND
         p.age BETWEEN 2 AND 41
     GROUP
        BY
         grp
    HAVING
         grp <> 0
       AND
         COUNT(*) > 2

    That one is probably more scalable in the end.

    The largest difference in this case that the where clause doesn't eliminate all possible non matches. Instead all rows not matching any criteria are placed into a group that is later removed via the having filter. The obvious advantage of this is that you will only every need to have two conditions in the where clause. The disadvantage is that the case statement is going to cycling through useless data. That shouldn't be much of an issue though.
    The only code I hate more than my own is everyone else's.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •