SitePoint Sponsor

User Tag List

Results 1 to 15 of 15
  1. #1
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    51
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Arrow performance one join query vs (1+n) queries

    Hi,
    On the net many people advocate using joins in sql. I was wondering whether this always makes sense. Consider this 1:N example, where people can hire multiple books. We want to show all people, with the books they hire.

    Code:
    SELECT person.*, book.title, book.id
    FROM person, book
    WHERE book.lend_to_person_id = person.person_id
    If a person lends on average 5 books per time, we will get on average 5 rows with the same person in it for every person.
    This forces us to loop like this
    PHP Code:
    $previousPersonId null;
    foreach (
    $results as $result)    {
        
    // we have a new person here
        
    if ($result['person_id'] != $previousPersonId)    {
            echo 
    "books for {$result['name']}: \n";
        }
        echo 
    "\t {$result['title']} \n";

        
    $previousPersonId $result['person_id'];
    }

    /* output
    books hired by John:
        some book
        other book
        foo book
    books hired by Pierre
        La nuit
        Le petit prince
    */ 
    Q. Wouldn't it be better to select all persons fist, and then select hired books per person afterwards?

  2. #2
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    A: no
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  3. #3
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    51
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That's a clear answer
    I believe you, don't mind. I however don't understand why loading joined data that is 5 times bigger than unjoined data is superior to the the latter though. Can you explain that ?

    And what if I make two queries:
    1. select all persons, then
    2. select books using "WHERE IN(1, 2, 4, ...)"?
    Still worse?

  4. #4
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    yes, still worse ...

    ... in my opinion

    would you care to run a test?

    develop two pieces of code, one for a single query, and the other for the multiple queries

    make sure that you complete the exercise by ensuring that the display of data on the web page is included in your coding effort

    in other words, this is not about query efficiency alone, but also about development time

    now, add up the CPU time and compare

    then, add up the development hours and compare

    multiply the CPU time by $0.0000003 per second

    multiply the development time by $100.00 per hour

    what did you end up with?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  5. #5
    SitePoint Wizard
    Join Date
    Jul 2003
    Location
    Kent
    Posts
    1,921
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Output before processing
    personID
    1
    3
    4
    6
    9

    bookID PersonID
    1, 9
    3, 1
    4, 6
    8, 4
    10, 1
    12, 3
    15, 9
    19, 1
    20, 6
    21, 1
    30, 4
    40, 9
    50, 4
    51, 3
    etc, etc

    you've got a lot of looping here to relate person (and their other data) to book (and its other data)
    If only there was an existing method of doing that - wait a minute, there is! It's called sql and relational databases

  6. #6
    Non-Member
    Join Date
    Dec 2006
    Posts
    269
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    There is a lot going on here, so it is difficult to dissemble your question, and understand specifically where you may be correct and where you are incorrect.

    A few things about SQL and relational in general:

    1) Results are returned as sets, so you will get multiple rows returned by the RDBMS with one query.

    2) The optimizer attempts to minimize the number of disk I/Os that are required to return the result set based upon the indexes that are available, the ability of the optimizer to correctly choose the correct indexes, the ability of the optimizer to use multiple indexes correctly, distribution of data, the clustering of data, the size of rows in respect to the size of a block, the type of SQL query and the ability to use the optimizer, the expectations that data will be found in cache, etc. All optimizers are different. Some will parse a query better than others depending upon how well the optimizer is written and what kind of statistics are kept on the data in the SQL catalog.

    3) Because optimizers do not always pick the correct optimization plan, there in fact may be times that you have to work around it, in order to achieve acceptable performance. (Forcing an index is one example). This is very bad, since it degrades "physical database independence" in your program (optimizers change and over time may do better than what you forced fit in programming code), but sometimes it is necessary.

    So to answer your question: in the vast majority of situations, it is better to use relational operations, retrieve data as tuple (row) sets, and let the optimizer do the job of managing the process of minimizing I/Os. However, practically this may not be possible in all situations, so be aware that there may be times where you have to "fine tune" the process using programming code. But do this only as a very last option, because long term, any such optimization code that is embedded in programs may come back to haunt you.

    In your specific example, it is difficult to say whether or not you may run into problems with an optimizer. It depends upon whether the optimizer does excessive joins in a database caused by an incorrect interpretation of the data in your database. There are certainly times, when a program will run faster by simply running a full table scan, buffering data in a temporary table, and then running a subsequent retrieval or smaller join. There are exceptions to every rule.

    Hope this helps,

    Rich

  7. #7
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    plaatspunt, here's an article that might be useful...

    Minimize Bandwith in One-to-Many Joins
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  8. #8
    Non-Member
    Join Date
    Dec 2006
    Posts
    269
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    plaatspunt, here's an article that might be useful...

    Minimize Bandwith in One-to-Many Joins
    Hi,

    It is an interesting article, but I believe that if we are using database servers (e.g. MySQL), then it is less of a bandwidth issue and more of an I/O issue. If the RDBMS is able to cache the article, then there will be no additional I/O's required. However, if the article is indeed very large, then it may not be possible to cache it. Also, many RDBMSs will seperate the storage of the large text from the fixed length record, so the join may be resolved without resorting to multiple retrievals of the large text. Unfortunately, lots of this is RDBMS dependent which does somewhat degrade the notion of physical data independence between the program and RDBMS.

    Thanks for the link.

    Rich

  9. #9
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    51
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi all, thanks for your comments!
    in other words, this is not about query efficiency alone, but also about development time
    I was trying to make a benchmark, I will see if I succeed. I am concerned about query efficiency alone, because I am striving for an ORM-solution. Parsing result sets is hidden inside this layer.

    @richf This is clear. I am just curious if you know whether the duplicate entries actually refer to the same memory?

    This is a really nice trick! Many times the joined tables don't share the same colum names and datatypes, so what if we combine them horizontally using aliases and UNION ALL?
    Code:
    person.name person.age  book.title      book.author     book.isbn
    john        23          NULL            NULL            NULL
    NULL        NULL        foo book        foo auth        12344321
    NULL        NULL        bar book        bar auth        12344326
    NULL        NULL        some book       some auth       08976456
    In I also wondered how you would associate comments and posts in your example if you retrieve more than one post. You query is flexible enough to handle such an expression, but wouldn't you need more information for this displaying?
    Last edited by plaatspunt; Mar 13, 2007 at 12:37. Reason: typo

  10. #10
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    51
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by richrf View Post
    Hi,

    It is an interesting article, but I believe that if we are using database servers (e.g. MySQL), then it is less of a bandwidth issue and more of an I/O issue. If the RDBMS is able to cache the article, then there will be no additional I/O's required. However, if the article is indeed very large, then it may not be possible to cache it. Also, many RDBMSs will seperate the storage of the large text from the fixed length record, so the join may be resolved without resorting to multiple retrievals of the large text. Unfortunately, lots of this is RDBMS dependent which does somewhat degrade the notion of physical data independence between the program and RDBMS.

    Thanks for the link.

    Rich
    Really interesting! Do you know how MySql [InnoDb] perform with respect to this?

  11. #11
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    rich, maybe you did not understand what the article was talking about

    the bandwidth in question is the amount of data that has to be sent up the pipe from the database server to the web server

    back in post #1, you will see that this is about the query result set that comes into the php code from the mysql query which happens to be a join

    the query result set "repeats" the data from the "one" table on every row of the "many" data

    the article shows how to "collapse" that data stream so that the "one" data in the one-to-many relationship is present only once, ahead of and "interleaved" with rows of "many" data

    thus minimizing the bandwidth between the database and the application layer

    whether the database employs caching is somewhat beside the point

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  12. #12
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by plaatspunt View Post
    Many times the joined tables don't share the same colum names and datatypes, so what if we combine them horizontally using aliases and UNION ALL?
    Code:
    person.name person.age  book.title      book.author     book.isbn
    john        23          NULL            NULL            NULL
    NULL        NULL        foo book        foo auth        12344321
    NULL        NULL        bar book        bar auth        12344326
    NULL        NULL        some book       some auth       08976456
    here's how i would structure that data --
    Code:
    personid  isbn      title       author    
        4     NULL      john        23        
        4     12344321  foo book    foo auth  
        4     12344326  bar book    bar auth  
        4     08976456  some book   some auth
    since this is about lending books, the row with isbn NULL cannot be a book, so it's the person

    and of course the result set uses ORDER BY person, isbn, so that NULL row comes first

    neat, eh?

    name overlaps with title, and age cast as string overlaps with author
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  13. #13
    Non-Member
    Join Date
    Dec 2006
    Posts
    269
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by plaatspunt View Post
    Really interesting! Do you know how MySql [InnoDb] perform with respect to this?
    Sorry. I am rusty with my RDBMSs, and haven't had a chance to look at the different implementations yet, at this level. There are some aspects of MySQL that are quite nice, but on the whole, it is a rather crude engine compared to those vendors that have been around a lot longer. But it is popular (probably because of the price), and it does do a good job on many type of applications.

    Rich

  14. #14
    Non-Member
    Join Date
    Dec 2006
    Posts
    269
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    rich, maybe you did not understand what the article was talking about

    the bandwidth in question is the amount of data that has to be sent up the pipe from the database server to the web server
    Yes, there was a bit of misunderstanding, since bandwidth to me usually means communication bandwidth. When it comes to databases, the amount of time to send information over the channel is usually quite small compared to the actual disk seek time, so I normally talk about saving I/O. Of course, this is heavily dependent on what type of storage device is being used, and what type of caching is available. Lots of variables. :-)

    back in post #1, you will see that this is about the query result set that comes into the php code from the mysql query which happens to be a join

    the query result set "repeats" the data from the "one" table on every row of the "many" data

    the article shows how to "collapse" that data stream so that the "one" data in the one-to-many relationship is present only once, ahead of and "interleaved" with rows of "many" data

    thus minimizing the bandwidth between the database and the application layer
    Yes, this is a good point. In this case, one can say that Article and Comment are essentially the same Entity type (same column set) and therefore the result set can be represented by a Union (merging), instead of a Join (combining). Yes, I very much agree, though I would explain the concept differently. Thanks for explaining this to me better.

    Rich

  15. #15
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    51
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    here's how i would structure that data --
    Code:
    personid  isbn      title       author    
        4     NULL      john        23        
        4     12344321  foo book    foo auth  
        4     12344326  bar book    bar auth  
        4     08976456  some book   some auth
    since this is about lending books, the row with isbn NULL cannot be a book, so it's the person

    and of course the result set uses ORDER BY person, isbn, so that NULL row comes first

    neat, eh?

    name overlaps with title, and age cast as string overlaps with author
    Yes, I might well use this in an automated way. I doubt if it is faster to pack reuse columns instead of just adding these columns to resultset. In the end, the latter costs only NULL's. That shouldn't add that much.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •