SitePoint Sponsor

User Tag List

Page 2 of 2 FirstFirst 12
Results 26 to 37 of 37
  1. #26
    SitePoint Guru bronze trophy TomB's Avatar
    Join Date
    Oct 2005
    Location
    Milton Keynes, UK
    Posts
    988
    Mentioned
    9 Post(s)
    Tagged
    2 Thread(s)
    Considering there are many more cases apart from pulling the first item from the data source you need to have a very intelligent translator of object access properties and methods to sql so that it remains efficient. Still, even the most intelligent translator cannot optimize everything if you don't tell it up front what data you will need to request later because often it's much faster to get more data in bigger chunks than to request many smaller ones. Therefore from performance point of view what cpradio suggests makes more sense - fetch all data beforehand in one go.
    No ORM can optimise everything. You don't want to tell it what data to fetch. Part of the job of the ORM is to ensure consistency. From an OOP perspective objects should always be complete. Admittedly, in the name of performance it's usually better to specify certain columns (such as blob ones) which are lazy loaded, but as a rule, a record is a data set and needs to have integrity. By fetching part of the data, you lose this.

    Whether you're using joins or separate queries, the DB is going to be the slowest part of the system.

    Have you benchmarked that? Most of the time additional queries are faster than joins, especially when sorting is involved or you take prepared statements into account. The number of queries has little effect, if you're running 10 queries that do a PK lookup and all run in a matter of miliseconds it's better than running a query which does 10 joins, a sort and takes 2 seconds to run.

    If you are indeed "fetching all the data beforehand" that is potentially far worse performance! You're fetching data you may never need or want.

    However, I don't think it doesn't sacrifice some portion of performance which can be important for any large system. For anything small it's sweet to have so much data source abstraction and independence but when a site gets large and datasets expand then it becomes important to tweak or rewrite individual sql queries and at that stage this abstraction becomes a hindrance. And I don't think being able to substitute a relational database to XML files or whatever else you may think of is important (unless you have a specific requirement for this in a project). In a small system you can play with it but with a large database this would be almost insane
    Again, you're making assumptions about performance problems that don't exist because additional queries are often faster!

    Funnily enough, it's those larger systems where being able to join from multiple sources has the largest benefit and practical use because they often do need to connect to external data sources, multiple databases, etc. In fact, it's only in small self-contained systems where you wanted to use a DB specific data mapper because you can be fairly sure no other data sources will be needed. The larger the system, the more features there are and it becomes increasingly more likley that external/unknown data sources are needed.

    Consider:

    PHP Code:
    //Show user's latest tweet
    echo $this->mapper->user->findById(123)->twitter[0]->content
    Which would use a mapper that connected to the users twitter feed and found the tweets. That is, of course, the tip of the iceberg.


    This is an ongoing question of how far we are willing to go with implementing good and flexible OOP at the expense of practical usefulness (performance). I don't think in PHP I would go as far as you but certainly that would be a good exercise in learning OOP techniques. Everyone chooses their own balance.
    Again though, this is a strawman argument. There is no performance issue. Using inner joins can be slower!

    - no need to define metadata, no XML or other model/relationship configuration files, no other maintenance chores (in my case it's just a question of running one script that will automatically reverse-engineer all my database structure into proper objects)
    You do need some metadata somewhere. The fact that you have expressed it inside queries doesn't mean that it isn't there. Those joins in your queries ARE the metadata. They are defining the relationships between your entities. Whether they're stored in an SQL query, an XML file or wherever is irrelevant, any ORM which handles relationships will need some knowledge of it.


    You are too hung up on a performance issue (which doesn't even exist!) ignoring that, there is always a cost-to-benefit ratio. The cost to performance is nil or almost nil whereas the benefit is huge. Consider testing. Being able to quickly and easily substitute any mapper to use an XML file instead of the database immediately makes testing far easier by removing the database dependency.

    A practical example: On one site I worked on, a client had a stock control system used at their physical store which already contained almost all the information we'd ever need about the products. This was a desktop application running on their network that could be connected to as a web service to retrieve the data. By simply altering a couple of lines in the mapper definition all the data could be used in real-time and work with our existing shopping cart system with no changes and no need to store a copy of the data in the database potentially creating syncing issues! Then simply save the data back into the stock control system transparently. All using the existing code that we have used on several entirely database driven sites.

  2. #27
    SitePoint Guru bronze trophy
    Join Date
    Dec 2003
    Location
    Poland
    Posts
    926
    Mentioned
    7 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TomB View Post
    No ORM can optimise everything. You don't want to tell it what data to fetch. Part of the job of the ORM is to ensure consistency. From an OOP perspective objects should always be complete. Admittedly, in the name of performance it's usually better to specify certain columns (such as blob ones) which are lazy loaded, but as a rule, a record is a data set and needs to have integrity. By fetching part of the data, you lose this.
    Yes, I agree, objects should be always complete. That's why I said that loading data in advance is a compromise if we want to achieve better performance - if performance can be achieved that way, for as far as I can see it is debatable right now .

    Quote Originally Posted by TomB View Post
    Have you benchmarked that? Most of the time additional queries are faster than joins, especially when sorting is involved or you take prepared statements into account. The number of queries has little effect, if you're running 10 queries that do a PK lookup and all run in a matter of milliseconds it's better than running a query which does 10 joins, a sort and takes 2 seconds to run.

    If you are indeed "fetching all the data beforehand" that is potentially far worse performance! You're fetching data you may never need or want.

    Again, you're making assumptions about performance problems that don't exist because additional queries are often faster!
    I can't agree with that statement entirely. It all depends on what kind of query is run and we can't generalise it like that. In my experience it happens *sometimes* that separate queries are faster than joins but I've had very few cases like that. I can see that most of my joins perform very fast. Just for the sake of curiosity I've just made a benchmark of a common join, a simple case where I make a listing of products and need to fetch manufacturer name from a separate table:
    Code:
    SELECT p.*, m.*	FROM products p
    	LEFT JOIN manufacturer m
    	ON m.id=p.id
    	
    	WHERE p.hidden=0
    	ORDER BY p.price
    	LIMIT 1000;
    Table about 6000 records, no indexes except on ids. I did this without any ORMs, I just loaded the entire result set into array and timed it. Then I did the same without a join and in a loop I loaded each manufacturer data from the database. The whole operation was about 2.5 times faster for the join. The interesting thing was that when I did the same benchmark loading the data into objects through my ORM then the join performed about 2.5 times slower - but that was due to the way I fetch data - when I fetch data into a single object (without joins) then I load all row with a single statement, however for joins there is a loop that traverses all fetched fields so that the appropriate ones are populated into related objects. So taking this into consideration join will be slower in my case because I can't optimize fetching the data (I can't find a way to load the related row without looping over each column and doing some logic to determine where the data should go).

    So it's not so much a question of joins vs separate queries but rather the overall overhead of the mapper/ORM. Loading everything properly into objects can be 2 or 3 times slower compared to a more direct method - this may not be much for most sites but for busy ones it may mean sooner need for a dedicated server. If I allow myself some inconsistency in my objects then I can for example load the manufacturer name with a JOIN into the Product object and the speed will be better than any other solution. Sure, inconsistency will have its price.


    Quote Originally Posted by TomB View Post
    Funnily enough, it's those larger systems where being able to join from multiple sources has the largest benefit and practical use because they often do need to connect to external data sources, multiple databases, etc. In fact, it's only in small self-contained systems where you wanted to use a DB specific data mapper because you can be fairly sure no other data sources will be needed. The larger the system, the more features there are and it becomes increasingly more likley that external/unknown data sources are needed.

    Consider:

    PHP Code:
    //Show user's latest tweet
    echo $this->mapper->user->findById(123)->twitter[0]->content
    Which would use a mapper that connected to the users twitter feed and found the tweets. That is, of course, the tip of the iceberg.
    You may have a point here with the larger systems but I don't think the twitter example is the best one because even in a DB specific system I can easily do the above if I write my own method for connecting to twitter. But then yes, I need to do it all properly and not try to fetch twitter content with sql, I need to stick to the objects. The problem might be if I first had twitter data in DB and used SQL, then I switched to some remote feed.

    Quote Originally Posted by TomB View Post
    You do need some metadata somewhere. The fact that you have expressed it inside queries doesn't mean that it isn't there. Those joins in your queries ARE the metadata. They are defining the relationships between your entities. Whether they're stored in an SQL query, an XML file or wherever is irrelevant, any ORM which handles relationships will need some knowledge of it.
    Yes, I fully agree that I need metadata somewhere. I meant that I don't have to define them anywhere because they are generated automatically. The point is I don't need to spend time on them.

    Quote Originally Posted by TomB View Post
    You are too hung up on a performance issue (which doesn't even exist!) ignoring that, there is always a cost-to-benefit ratio. The cost to performance is nil or almost nil whereas the benefit is huge. Consider testing. Being able to quickly and easily substitute any mapper to use an XML file instead of the database immediately makes testing far easier by removing the database dependency.

    A practical example: On one site I worked on, a client had a stock control system used at their physical store which already contained almost all the information we'd ever need about the products. This was a desktop application running on their network that could be connected to as a web service to retrieve the data. By simply altering a couple of lines in the mapper definition all the data could be used in real-time and work with our existing shopping cart system with no changes and no need to store a copy of the data in the database potentially creating syncing issues! Then simply save the data back into the stock control system transparently. All using the existing code that we have used on several entirely database driven sites.
    Yep, I can see benefit in that, point taken!

    BTW, do you implement some form of caching? When you do a chain like $this->mapper->user->findById(123)->twitter[0]->content a few times then does your mapper fetch the data again from the DB?

  3. #28
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TheRedDevil View Post
    On a side note, connections in MySQL is very cheap (compared to SQL Server) so this is not a worry here.
    Quote Originally Posted by TomB View Post
    Again, you're making assumptions about performance problems that don't exist because additional queries are often faster!

    Funnily enough, it's those larger systems where being able to join from multiple sources has the largest benefit and practical use because they often do need to connect to external data sources, multiple databases, etc. In fact, it's only in small self-contained systems where you wanted to use a DB specific data mapper because you can be fairly sure no other data sources will be needed. The larger the system, the more features there are and it becomes increasingly more likley that external/unknown data sources are needed.
    Just wanted to point this out, as there is where you can gain significant performance improvements. If you ever deal with a SQL Server or MySQL Server or whatever database that is installed on a separate server or controlled/monitored by third parties, you can get significant performance savings by only needing to go to that server once. It is important to consider the fact that you will have network latency in your connections and fetching of results. Doing that repeatedly is going to add up in cost.

    Let me put that into a real life example. I worked on a project where the average page load time was 5-7 seconds (not terrible for the amount of data they were loading, but still slow). Ultimately, they were performing 600+ queries/fetches for data needed for the page. Changing those 600 queries (it was in a loop), down to 2 queries dropped the entire page load time to under 1 second. Their SQL Server was located on a separate server, so they did have network latency to consider, but they never realized that has their product grew, the loop was going to run more and more queries.

    The problem I see, is that some ORMs could have this affect, if not closely watched. Especially with remote sources. I'm not saying ORMs are bad, just do your due diligence and test them in a variety of situations before going all in on one.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  4. #29
    SitePoint Guru bronze trophy TomB's Avatar
    Join Date
    Oct 2005
    Location
    Milton Keynes, UK
    Posts
    988
    Mentioned
    9 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by Lemon Juice View Post
    I can't agree with that statement entirely. It all depends on what kind of query is run and we can't generalise it like that. In my experience it happens *sometimes* that separate queries are faster than joins but I've had very few cases like that. I can see that most of my joins perform very fast. Just for the sake of curiosity I've just made a benchmark of a common join, a simple case where I make a listing of products and need to fetch manufacturer name from a separate table:
    Code:
    SELECT p.*, m.*	FROM products p
    	LEFT JOIN manufacturer m
    	ON m.id=p.id
    	
    	WHERE p.hidden=0
    	ORDER BY p.price
    	LIMIT 1000;
    Table about 6000 records, no indexes except on ids. I did this without any ORMs, I just loaded the entire result set into array and timed it. Then I did the same without a join and in a loop I loaded each manufacturer data from the database. The whole operation was about 2.5 times faster for the join. The interesting thing was that when I did the same benchmark loading the data into objects through my ORM then the join performed about 2.5 times slower - but that was due to the way I fetch data - when I fetch data into a single object (without joins) then I load all row with a single statement, however for joins there is a loop that traverses all fetched fields so that the appropriate ones are populated into related objects. So taking this into consideration join will be slower in my case because I can't optimize fetching the data (I can't find a way to load the related row without looping over each column and doing some logic to determine where the data should go).
    How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it's sorting the entire data set. Move the sort to array_shuffle($products) and it's suddenly much faster. This is scaled by the number of tables you are joining.

    So it's not so much a question of joins vs separate queries but rather the overall overhead of the mapper/ORM. Loading everything properly into objects can be 2 or 3 times slower compared to a more direct method - this may not be much for most sites but for busy ones it may mean sooner need for a dedicated server. If I allow myself some inconsistency in my objects then I can for example load the manufacturer name with a JOIN into the Product object and the speed will be better than any other solution. Sure, inconsistency will have its price.
    But it depends what your query is doing. Once you start adding more joins, sorts and grouping (which you'll need if you using joins in this way) the single query method very quickly loses any performance benefit it had. Simple is always faster and once you start using prepared statements to fetch the manufacturers the difference will become even less in your very simplistic example.

    That's fair enough, but we recently had an interesting topic on premature optimisation and that's exactly what it sounds like to me. The problem is you're sacrificing flexibility in the name of performance. Fixing performance problems can easily be done after the fact with a host of different methods. Fixing flexibility problems generally involves re-writing a lot of code because the code isn't flexible enough to handle anything outside what it was initially designed to do. Once the client moves the goalposts and says

    Yes, I fully agree that I need metadata somewhere. I meant that I don't have to define them anywhere because they are generated automatically. The point is I don't need to spend time on them.
    How are your joins generated automatically? Somewhere you need to store data about the primary key/foreign key relationships. The query itself can be generated but which fields to join on needs to be stored somewhere. You'll need to spend time defining those somewhere if your ORM can support joins. It will also need to know the kind of join: 1:1m 1:n. This is all metadata and needs to be configured to enable your queries to be generated.

    BTW, do you implement some form of caching? When you do a chain like $this->mapper->user->findById(123)->twitter[0]->content a few times then does your mapper fetch the data again from the DB?
    Yes. Any time an individual record is fetched, it's stored in the mapper. Individual mappers can turn this off but it's something I rarely need to use. Instead I have it so that certain fields (e.g. longblob) can be lazy loaded.

    I did play around with caching results from queries that returned multiple results but I found that in the real world, 99% of the time these are only used once whereas fetching by ID, in addition to being a lot less data to store in memory, is often something that ends up being used multiple times. If I want a product manufacturer name, it's likely I'll also want that product's manufacturer logo. So in that example, the user would be cached in memory for the duration of the script.


    Just wanted to point this out, as there is where you can gain significant performance improvements. If you ever deal with a SQL Server or MySQL Server or whatever database that is installed on a separate server or controlled/monitored by third parties, you can get significant performance savings by only needing to go to that server once. It is important to consider the fact that you will have network latency in your connections and fetching of results. Doing that repeatedly is going to add up in cost.

    Let me put that into a real life example. I worked on a project where the average page load time was 5-7 seconds (not terrible for the amount of data they were loading, but still slow). Ultimately, they were performing 600+ queries/fetches for data needed for the page. Changing those 600 queries (it was in a loop), down to 2 queries dropped the entire page load time to under 1 second. Their SQL Server was located on a separate server, so they did have network latency to consider, but they never realized that has their product grew, the loop was going to run more and more queries.

    The problem I see, is that some ORMs could have this affect, if not closely watched. Especially with remote sources. I'm not saying ORMs are bad, just do your due diligence and test them in a variety of situations before going all in on one.
    Of course 600+ queries is too much, at the same time and equally anecdotally I've fixed a very slow page by breaking it up into separate queries. It was a reporting query dealing with a hell of a lot of data and joining around 15 tables, sorting and grouping. In both our examples, the problem was the original developer neglecting to think about consequences. It was taking around 45 seconds to run.

    However, slow single queries are substantially worse than a page that is slow because it's executing too many queries. That reporting query that took 45+ seconds to run (actually that's generous, some of them were more than double that!) joined a table that stored questionnaire answers. Questionnaires were being filled in constantly (probably around every 5-10 seconds a write would happen to one of the answers table.) when this happens, because the SELECT was still running and the UPDATE/INSERT was issued, the answers table got locked and suddenly nobody across the entire site could even view a questionnaire or any kind of results page. All the result of someone viewing the results of a questionnaire from the past! Joyous! Once the reporting pages were changed to use separate queries, they still took a while to run (significantly less though! Roughly half the time) whether they were running or not, it never broke functionality across other parts of the site.

  5. #30
    From space with love silver trophy
    SpacePhoenix's Avatar
    Join Date
    May 2007
    Location
    Poole, UK
    Posts
    4,910
    Mentioned
    96 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TomB View Post
    How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it's sorting the entire data set. Move the sort to array_shuffle($products) and it's suddenly much faster. This is scaled by the number of tables you are joining.



    But it depends what your query is doing. Once you start adding more joins, sorts and grouping (which you'll need if you using joins in this way) the single query method very quickly loses any performance benefit it had. Simple is always faster and once you start using prepared statements to fetch the manufacturers the difference will become even less in your very simplistic example.

    That's fair enough, but we recently had an interesting topic on premature optimisation and that's exactly what it sounds like to me. The problem is you're sacrificing flexibility in the name of performance. Fixing performance problems can easily be done after the fact with a host of different methods. Fixing flexibility problems generally involves re-writing a lot of code because the code isn't flexible enough to handle anything outside what it was initially designed to do. Once the client moves the goalposts and says



    How are your joins generated automatically? Somewhere you need to store data about the primary key/foreign key relationships. The query itself can be generated but which fields to join on needs to be stored somewhere. You'll need to spend time defining those somewhere if your ORM can support joins. It will also need to know the kind of join: 1:1m 1:n. This is all metadata and needs to be configured to enable your queries to be generated.



    Yes. Any time an individual record is fetched, it's stored in the mapper. Individual mappers can turn this off but it's something I rarely need to use. Instead I have it so that certain fields (e.g. longblob) can be lazy loaded.

    I did play around with caching results from queries that returned multiple results but I found that in the real world, 99% of the time these are only used once whereas fetching by ID, in addition to being a lot less data to store in memory, is often something that ends up being used multiple times. If I want a product manufacturer name, it's likely I'll also want that product's manufacturer logo. So in that example, the user would be cached in memory for the duration of the script.




    Of course 600+ queries is too much, at the same time and equally anecdotally I've fixed a very slow page by breaking it up into separate queries. It was a reporting query dealing with a hell of a lot of data and joining around 15 tables, sorting and grouping. In both our examples, the problem was the original developer neglecting to think about consequences. It was taking around 45 seconds to run.

    However, slow single queries are substantially worse than a page that is slow because it's executing too many queries. That reporting query that took 45+ seconds to run (actually that's generous, some of them were more than double that!) joined a table that stored questionnaire answers. Questionnaires were being filled in constantly (probably around every 5-10 seconds a write would happen to one of the answers table.) when this happens, because the SELECT was still running and the UPDATE/INSERT was issued, the answers table got locked and suddenly nobody across the entire site could even view a questionnaire or any kind of results page. All the result of someone viewing the results of a questionnaire from the past! Joyous! Once the reporting pages were changed to use separate queries, they still took a while to run (significantly less though! Roughly half the time) whether they were running or not, it never broke functionality across other parts of the site.
    How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries. Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it's sorting the entire data set. Move the sort to array_shuffle($products) and it's suddenly much faster. This is scaled by the number of tables you are joining.

    What DB engine was in use at the time? Assuming that it was a MySQL server my gut feeling is that it was a MyISAM table, the table(s) could be migrated over to InnoDB but you'd have to keep in mind is that versions of MySQL prior to 5.6.4 have no support for full text search and has only been available in any form from version 5.6.4 of MySQL (http://blogs.innodb.com/wp/2011/12/i...n-mysql-5-6-4/).

    In this thread various methods of getting the results from a large result set in a random order and the most efficient turned out to be the one where all the sorting was offloaded to the language being used to process the dataset (PHP).

    Code MySQL:
    SELECT p.*, m.*
        FROM products p
         LEFT JOIN manufacturer m
         ON m.id=p.id
              WHERE p.hidden=0
         ORDER BY p.price
         LIMIT 1000;

    You're selecting all the fields from both the tables, how many of them fields do you actually need? You should list the ones required only in the SELECT clause otherwise your transferring the data from x number of un-needed fields which you don't need, a waste of bandwidth, especially if MySQL and PHP aren't on the same server box.

    It would be interesting to hear @r937 ;'s take on this thread from a database POV
    Community Team Advisor
    Forum Guidelines: Posting FAQ Signatures FAQ Self Promotion FAQ
    Help the Mods: What's Fluff? Report Fluff/Spam to a Moderator

  6. #31
    SitePoint Guru bronze trophy
    Join Date
    Dec 2003
    Location
    Poland
    Posts
    926
    Mentioned
    7 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TomB View Post
    How many queries did you run? You should have run 1 query to fetch the products and one for the manufacturer of each product. The result being N+1 queries.
    Yes, in my benchmark I did N+1 queires, which was 1001 queries. 1001 simple queries were 2.5x slower than a single LEFT JOIN to fetch the same data from two tables. This was on my local installation so there were no network latencies.

    Quote Originally Posted by TomB View Post
    Imagine you want to sort the initial product list. Add ORDER BY product.name to that or ORDER BY RAND() and your single query will be immensely slower because it's sorting the entire data set. Move the sort to array_shuffle($products) and it's suddenly much faster.
    Yes, ordering by product name was slightly slower but that's understandable because it's a text field and not a number (price). In this case the separate 1001 were 2x slower than 1 LEFT JOIN. Still JOIN was much better.

    Quote Originally Posted by TomB View Post
    This is scaled by the number of tables you are joining.
    Okay, just out of curiosity I added another LEFT JOIN that fetched product's availability - a row from a small lookup table joined by a numeric availability_id:

    Code:
    SELECT p.*, m.*, a.*
        FROM products p
         LEFT JOIN manufacturer m
         ON m.id=p.id
    		 
         LEFT JOIN availability a
         ON a.availability_id=p.availability_id
    
         WHERE p.hidden=0
         ORDER BY p.name
         LIMIT 1000;
    The time for the single JOIN query became very marginally slower (almost the same as with 1 JOIN) but fetching the same data in separate queries became much longer - understandable since now there are 2001 queries to run. This time the single query performed 3.5x faster than the separate queries.

    Okay, your scaling argument holds true for INNER JOINs, though. But still the single query was 2x faster. However, with a little bit of tweaking I was able to run the INNER JOINs as fast as LEFT JOINs - by adding index to product.name and adding FORCE INDEX(prod_name) to the query.

    I don't see that joins are as slow as you're making them to be. Sure, there will be cases where a join will be inferior but it's not that often, and if that happens there's still room for improvement by optimising, adding indexes, etc. And when a large query takes too much time it's often enough to take 1 or 2 joins into separate tables to get rid of performance problems. I'm not trying to be argumentative to prove joins are the way to go - I was curious myself how they perform, that's why I did the above benchmarks. Another thing - recent versions of Mysql have had many performance improvements and the same joins you found slow many years ago might run much faster now.

    Quote Originally Posted by TomB View Post
    That's fair enough, but we recently had an interesting topic on premature optimisation and that's exactly what it sounds like to me. The problem is you're sacrificing flexibility in the name of performance. Fixing performance problems can easily be done after the fact with a host of different methods. Fixing flexibility problems generally involves re-writing a lot of code because the code isn't flexible enough to handle anything outside what it was initially designed to do.
    It's very subjective where premature optimisation starts. Personally I prefer not to optimise too much but still to keep performance in mind when developing a web site. In other words use the convenience of objects whenever possible but still try to implement solutions that will perform fairy fast. Depending on a project the flexibility of being able to replace the whole data source may not be needed at all.

    If you develop the whole system using perfect objects oriented approach with an ORM or a mapper for data access then how are you going to optimise it if you stumble accross a problem like cpradio described - the network latency is too high and you need to reduce the number of queries to 1 or 2? Your mapper can't do it because it hasn't enough knowledge for that, then you need to get out of your way and introduce ugly hacks giving up consistency and flexibility that your objects provide - because now you need to run a single big query instead of hundreds of small ones. If the system is prepared for this from the ground up then it's very easy task.

    It looks like you have one kind of flexibility at the expense of another one - the flexibility to optimise database access. There's nothing wrong with it but it only proves that there's no ideal system that can be 100% flexible in every aspect.

    Quote Originally Posted by TomB View Post
    Yes. Any time an individual record is fetched, it's stored in the mapper. Individual mappers can turn this off but it's something I rarely need to use. Instead I have it so that certain fields (e.g. longblob) can be lazy loaded.

    I did play around with caching results from queries that returned multiple results but I found that in the real world, 99% of the time these are only used once whereas fetching by ID, in addition to being a lot less data to store in memory, is often something that ends up being used multiple times. If I want a product manufacturer name, it's likely I'll also want that product's manufacturer logo. So in that example, the user would be cached in memory for the duration of the script.
    Ok, thanks for the description!

  7. #32
    SitePoint Guru bronze trophy
    Join Date
    Dec 2003
    Location
    Poland
    Posts
    926
    Mentioned
    7 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SpacePhoenix View Post
    What DB engine was in use at the time?
    The engine was InnoDb on Mysql 5.5.19.

    Quote Originally Posted by SpacePhoenix View Post
    In this thread various methods of getting the results from a large result set in a random order and the most efficient turned out to be the one where all the sorting was offloaded to the language being used to process the dataset (PHP).
    Yes, ORDER BY RAND() is known to be slow in Mysql. In this case it was not that bad - almost the same as order by product.name - but that's probably because the product table is not that big.

    Quote Originally Posted by SpacePhoenix View Post
    Code MySQL:
    SELECT p.*, m.*
        FROM products p
         LEFT JOIN manufacturer m
         ON m.id=p.id
              WHERE p.hidden=0
         ORDER BY p.price
         LIMIT 1000;

    You're selecting all the fields from both the tables, how many of them fields do you actually need? You should list the ones required only in the SELECT clause otherwise your transferring the data from x number of un-needed fields which you don't need, a waste of bandwidth, especially if MySQL and PHP aren't on the same server box.

    It would be interesting to hear @r937 ;'s take on this thread from a database POV
    Haha, it's funny that you mention that! I am selecting all fields from both of the tables because selecting all fields is necessary when you are using an ORM - you need to fill the objects with all fields to keep them complete, that's the whole point. This is at odds with the optimisation practices you mention above. From my observation fetching all rows is only marginally less performant unless you fetch some big texts or blobs unnecessarily. But I had to benchmark selecting all fields for the results to be relevant to the topic we are discussing.

    r937 is welcome to comment, it would be interesting what he thinks about running tens or hundreds separate queries instead of joins, not worrying about what queries are run by your system under the hood and the like... . Mapping db relations to objects is not easy and always has some drawbacks - some people claim it's not even possible or doesn't make sense because the two paradigms are so much different!

  8. #33
    SitePoint Guru bronze trophy TomB's Avatar
    Join Date
    Oct 2005
    Location
    Milton Keynes, UK
    Posts
    988
    Mentioned
    9 Post(s)
    Tagged
    2 Thread(s)
    The time for the single JOIN query became very marginally slower (almost the same as with 1 JOIN) but fetching the same data in separate queries became much longer - understandable since now there are 2001 queries to run. This time the single query performed 3.5x faster than the separate queries.

    Okay, your scaling argument holds true for INNER JOINs, though. But still the single query was 2x faster. However, with a little bit of tweaking I was able to run the INNER JOINs as fast as LEFT JOINs - by adding index to product.name and adding FORCE INDEX(prod_name) to the query.
    And this is always going to be the case. In your 1001 queries vs 1 query example, you're comparing apples and oranges because you need to account for the time PHP takes to populate the object graph. Add caching to the multi queries so that each manufacturer is only loaded once and the number of queries will be (number of manufacturers used by the 2000 products)+1 which in a lot of real-life scenarios will be significantly lower than n+1. You'll probably find the database does something similar internally when you run your query with the join. With the addition of prepared statements you'll find the difference in speed even less (and make sure they're prepared, not using PDO::ATTR_EMULATE_PREPARES)


    More importantly, with separate queries data is only loaded when it's needed. So although one page may be marginally slower because it does more queries, another may be faster because it never needs to load manufacturer info for a product. It depends how efficient your ORM is at only fetching the exact data it needs. Mine will only ever fetch records it's going to use because it's all done JIT.

    I don't see that joins are as slow as you're making them to be.
    I never said they were slow, I stated that there are enough cases where they are and where they are faster it's not enough of a difference to sacrifice the flexibility gained by allowing data from multiple sources.


    the network latency is too high and you need to reduce the number of queries to 1 or 2? Your mapper can't do it because it hasn't enough knowledge for that, then you need to get out of your way and introduce ugly hacks giving up consistency and flexibility that your objects provide - because now you need to run a single big query instead of hundreds of small ones. If the system is prepared for this from the ground up then it's very easy task.
    Actually, the mapper has all the information it needs and would be able to generate queries which fetched an entire object graph. My first ORM did exactly this and fetched data only when it got to an endpoint where data was actually used. It worked well and allowed $product->manufactuer->logo; type chaining which eventually ran a query that joined products/manufactuers but it simply wasn't flexible enough.

  9. #34
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,022
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by SpacePhoenix View Post
    It would be interesting to hear @r937 ;'s take on this thread from a database POV
    i have nothing to contribute, sorry

    i don't do php, i have no idea what classes are, and object-oriented anything is way over my head
    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  10. #35
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    i have nothing to contribute, sorry

    i don't do php, i have no idea what classes are, and object-oriented anything is way over my head
    @r937 ;, I think they were more interested on your thoughts of performing p.*, m.* in a query instead of specifically stating what columns you want (from a performance/paging perspective).
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  11. #36
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,022
    Mentioned
    53 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by cpradio View Post
    @r937 ;, I think they were more interested on your thoughts of performing p.*, m.* in a query
    oh, you mean the dreaded, evil "select star"?

    r937.com | rudy.ca | Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  12. #37
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by r937 View Post
    Yup
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •