SitePoint Sponsor

User Tag List

Results 1 to 18 of 18
  1. #1
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How to optimize this query

    Hey gents,

    I have a problem making a query fast enough. So if you guys could take some time to read through this and think of ways to make it quicker, that would be much appreciated.

    The setup
    -----------

    The user can specify several topics as their interests.
    A question is tagged with many topics.

    The goal is to display the questions to the user that are tagged with topics that are within his interests.

    To make it a little more difficult, topics are laid out as a polyhierarchic tree structure. That means one topic can have several parent topics. We need to show the user all questions tagged with the topics in his interests and all of their children.


    An example
    -------------

    Topic Tree structure:

    Computers -> Programming -> Web -> PHP -> CakePHP
    Computers -> Programming -> Web -> Frameworks -> CakePHP

    I have added "Web" to my interests. Now I need to display all questions that are tagged with "Web", "PHP", "Frameworks" or "CakePHP".


    Tables
    -------

    topics -> id, title
    parent_topics -> id, parent_id, topic_id
    interests -> id, user_id, topic_id // not parent_topic_id, as we are interested in a specific topic and do not care where it is in the tree

    questions -> id, title
    questions_topics -> id, question_id, topic_id


    My solution so far
    --------------------

    I have pre-computed and cached all topic children relationships in memory, so that is not a problem.
    Given a user I know all of his interests and all of their children in a matter of a couple milliseconds.

    Now having this (huge) list of topic_id's I need to find the questions for them.

    So what I do now is:

    select question_id from questions_topics where topic_id IN ([huge_list_of_ids]);

    This query is very slow, because the questions_topics table has 24k rows and the list of topic_ids that I feed to it can exceed several hundreds. If I reduce the number of topic_ids in the IN clause, the query runs in ~20ms as opposed to 800ms as it is now.

    The next query would be:

    select [fields] from questions where id IN ([found_question_ids]) AND deleted = 0 and published = 1 order by last_updated desc limit 0, 20;

    That is also pretty slow as the list of question_ids is big. So if you guys have a suggestion for that as well, that would be great.


    Ways to improve
    -------------------

    1. I tried indeces on questions_topics.topic_id and (questions_topics.question_id, questions_topics.topic_id). Both make the query even slower (significantly even).

    2. I could save all relationships of a question to its topics AND all of their children in the questions_topics table.

    So when a question is tagged with "CakePHP", I would have relations for it for "CakePHP", "PHP", "Frameworks", "Web", "Programming" and "Computers". I could then make the query much faster by only doing:

    select question_id from questions_topics where topic_id IN ([just_the_topics_ids_in_my_interests_without_children]);


    Since the tree is quite big, this would mean several hundred relationships saved in the DB for each question, making the table quite big and stuffed with redundant data. Plus I'd have to recalculate all of these relationships when someone changes the tree structure. I don't want to go there if not necessary ...

    3. Use something else altogether, like Redis?

    4. Your idea here


    Thank you for your time.

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by DarkAngelBGE View Post
    1. I tried indeces on questions_topics.topic_id and (questions_topics.question_id, questions_topics.topic_id). Both make the query even slower (significantly even).
    huh? that's counter-intuitive for sure

    please do an EXPLAIN on the query, and a SHOW CREATE TABLE on the table

    i'm assuming that despite the fact that you chose to post in the Databases forum rather than the MySQL forum, this is still a mysql problem, as i see you are using LIMIT
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here you go:

    Code:
    EXPLAIN SELECT `QuestionTopic`.`question_id`
    FROM `questions_topics` AS `QuestionTopic`
    WHERE `QuestionTopic`.`topic_id`
    IN (
    '4ea7b132-6948-4fc7-a7ac-c5634352af75', '4ea7b135-dc44-40df-bbeb-c5634352af75', '4ea7bd3d-b534-4f57-9b34-c5634352af75', '4ea7b3dc-6e8c-4c28-b05c-c5634352af75', '4ea7b3dc-263c-429e-97c0-c5634352af75', '4ea7bd3d-4d90-4c93-abe1-c5634352af75', '4ea7bd4a-639c-4780-80b6-c5634352af75'
    )
    GROUP BY question_id
    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	SIMPLE 	QuestionTopic 	range 	topic_id 	topic_id 	108 	NULL 	70 	Using where; Using temporary; Using filesort
    Code:
    CREATE TABLE `questions_topics` (
     `id` char(36) NOT NULL,
     `question_id` char(36) NOT NULL,
     `topic_id` char(36) NOT NULL,
     `latest_update` datetime NOT NULL,
     `created` datetime NOT NULL,
     PRIMARY KEY (`id`),
     KEY `question_id` (`question_id`,`topic_id`),
     KEY `topic_id` (`topic_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    Sorry for not posting this in the mysql forum. : /

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    okay, the EXPLAIN looks logical

    now please run this and try your EXPLAIN again --
    Code:
    ALTER TABLE questions_topics
    ADD INDEX ( topic_id , question_id )
    a many-to-many relationship table like this should always be declared ~without~ its own auto_increment or other surrogate key, but with one of the two two-column indexes declared as the PK, and the other one, with the two columns reversed, as an index

    example...
    Code:
    CREATE TABLE questions_topics 
    ( question_id CHAR(36) NOT NULL
    , topic_id CHAR(36) NOT NULL
    , latest_update DATETIME NOT NULL
    , created DATETIME NOT NULL
    , PRIMARY KEY ( question_id , topic_id )
    , INDEX ( topic_id , question_id )
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    both the PK and the second index are a covering index and they will allow lookups in either direction to be resolved without accessing the table at all!
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Code:
    id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
    1 	SIMPLE 	QuestionTopic 	range 	topic_id,topic_id_2 	topic_id_2 	108 	NULL 	70 	Using where; Using index; Using temporary; Using f...
    topic_id_2 is that compound index.

    Query went up from 250ms to 480ms, as I have tried before.

    Regarding changed CREATE TABLE ... will try that. Although the framework I am using does not support compound fields as primary keys for the Active Record magic. Will try though.

    Thanks for your time.

  6. #6
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by DarkAngelBGE View Post
    [Query went up from 250ms to 480ms, as I have tried before.
    weird

    it's the GROUP BY that's causing this: "Using temporary; Using filesort"

    you aren't returning a bazillion rows here, so you could remove the dupes in php, right?

    still, less than a second is pretty good response...
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  7. #7
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Changing the table layout brings the query down to 380ms, as opposed to my initial 250ms. Any more ideas?

  8. #8
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Removing the group by brings the query down from 250ms to 170ms, so that def. helps. Good suggestion.

    Do you suggest I should just live with it for now?

    How about that second query that fetches the actual questions with their id. That is at 180ms. Do you think this can further optimized?

    The query is

    Code:
    SELECT `Question`.`title`, `Question`.`slug`, `Question`.`body`, `Question`.`question_follow_count`, `Question`.`urgent`, `Question`.`answer_count`, `User`.`name`, `User`.`title`, `User`.`id`, `Question`.`id` FROM `questions` AS `Question` LEFT JOIN `users` AS `User` ON (`Question`.`user_id` = `User`.`id`) LEFT JOIN `polls` AS `Poll` ON (`Poll`.`foreign_id` = `Question`.`id`) WHERE `Question`.`id` IN ('4eb1076b-5790-4716-8298-dd4d4352af75', '4eb1076c-d500-4f9e-8be4-dd4d4352af75') AND `Question`.`deleted` <> 1 AND `Question`.`published` = 1 ORDER BY `Question`.`latest_update` desc LIMIT 10;
    Adding an index over id, deleted, published, latest_update brings the response time from 180ms to 430ms.

  9. #9
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by DarkAngelBGE View Post
    The query is ...
    ach du lieber, that is some ugly sql you have there

    Code:
    SELECT Question.title
         , Question.slug
         , Question.body
         , Question.question_follow_count
         , Question.urgent
         , Question.answer_count
         , User.name
         , User.title
         , User.id
         , Question.id 
      FROM questions AS Question 
    LEFT 
      JOIN users AS User 
        ON (Question.user_id = User.id) 
    LEFT 
      JOIN polls AS Poll 
        ON (Poll.foreign_id = Question.id) 
     WHERE Question.id IN 
           ( '4eb1076b-5790-4716-8298-dd4d4352af75'
           , '4eb1076c-d500-4f9e-8be4-dd4d4352af75' ) 
       AND Question.deleted <> 1 
       AND Question.published = 1 
    ORDER 
        BY Question.latest_update desc LIMIT 10;
    you said you added an index on ( id, deleted, published, latest_update ) and that showed that you are definitely on top of indexing strategies, because those are all the columns referenced in the WHERE clause

    i really cannot explain why your response time is increasing

    as for the query itself, why are you joining to the polls table? you aren't using any of its columns for anything...
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  10. #10
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This is how CakePHP maps a hasOne relation. It does the join and then has a second query that fetches the poll data

    Code:
    SELECT `Poll`.`id`, `Poll`.`foreign_id`, `Poll`.`question`, `Poll`.`created`, `Poll`.`updated` FROM `polls` AS `Poll` WHERE `Poll`.`foreign_id` = '4eb113d1-6a68-447f-98f5-dd4d4352af75'
    Not elegant I know, but oh well.

    Anyway, thanks for your help. I think this is sufficient for now. I can always add some caching strategies to cache the questions for specific parent topic combinations.

  11. #11
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    As for the uglyness of the sql, that is CakePHP's output. I didn't bother to format it beautifully. Sorry about that.

  12. #12
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by DarkAngelBGE View Post
    This is how CakePHP maps a hasOne relation. It does the join and then has a second query that fetches the poll data
    but the join is totally useless !!

    sigh

    oh those silly framework engineers

    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  13. #13
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  14. #14
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am having trouble finding a proper index for the following query:

    Code:
    EXPLAIN SELECT 
    `Feed`.`id`, `Feed`.`type`, `Feed`.`answer_id`,
    `Question`.`title`, `Question`.`slug` 
    FROM `feeds` AS `Feed` 
    LEFT JOIN `questions` AS `Question` ON (`Feed`.`question_id` = `Question`.`id`) 
    WHERE `Feed`.`type` = 'new_answer' AND 
    `Feed`.`topic_id` IN ('4ea7b2c0-e840-4aec-8a2d-c5634352af75', '4ea7b2c8-9d84-4771-9400-c5634352af75') AND 
    `Question`.`answer_count` > 0 
    GROUP BY `Feed`.`question_id` 
    ORDER BY `Feed`.`created` desc LIMIT 10;
    I tried an index on type,topic_id,question_id,created and one on type,topic_id,created.
    But EXPLAIN won't tell me it's using the index. What am I missing? : /

    Thank you for your time.

  15. #15
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    that query is seriously b0rked

    start by removing the GROUP BY clause
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  16. #16
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Tried that. Also does not use an INDEX.

  17. #17
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,015
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    can't help you, man, sorry

    can't figure your query out, it is still logically b0rked
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  18. #18
    SitePoint Wizard
    Join Date
    May 2003
    Location
    Berlin, Germany
    Posts
    1,829
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hehe well okay, I have rewritten the logic and have a completely different query now. So all good.

    Thx anyway.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •