SitePoint Sponsor

User Tag List

Page 4 of 4 FirstFirst 1234
Results 76 to 96 of 96
  1. #76
    SitePoint Addict Mastodont's Avatar
    Join Date
    Mar 2007
    Location
    Czech Republic
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    matthijsA: EAV is explicitly mentioned in my last referenced link, that's why I posted it.
    So the question is, does that schema scale or not?
    This can be AFAIK verified only on live database. Try have a look at Magento's db schema.

  2. #77
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In reply to matthijsA, that's pretty much what I said in this recent post...

    http://www.sitepoint.com/forums/show...2&postcount=59

  3. #78
    SitePoint Zealot
    Join Date
    Jun 2004
    Location
    Netherlands
    Posts
    172
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wardrop View Post
    In reply to matthijsA, that's pretty much what I said in this recent post...

    http://www.sitepoint.com/forums/show...2&postcount=59
    Maybe you said it in some way, but you keep repeating that my schema does not scale at all. And then you explain that it doesn't scale by applying my schema to your example of the computer shop in which there are endless possible combinations of products.

    However, the important point is that we are talking about different kinds of shops. I am talking about a shop with a "limited" amount of products. "Limited", can still be tens of thousands of products, like a big clothing shop or online book store. You are talking about a shop with almost endless combinations of products (order this motherboard with this memory with this case, etc).

    It's quite an important distinction to make. Allsprititeve, what kind of store are you looking at to build? Is it more a store with a limited amount of products? Or is it a store with endless combinations of products/attributes? Is it an e-commerce application which should be able to handle every situation?

  4. #79
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by matthijsA View Post
    but you keep repeating that my schema does not scale at all. And then you explain that it doesn't scale by applying my schema to your example of the computer shop in which there are endless possible combinations of products.
    I should clarify. By scale, I was referring more to the complexity of products which it supports. For example, my schema scales well from simple, to extremely complex products. My fault for not making that clearer.

  5. #80
    SitePoint Zealot
    Join Date
    Jun 2004
    Location
    Netherlands
    Posts
    172
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wardrop View Post
    I should clarify. By scale, I was referring more to the complexity of products which it supports. For example, my schema scales well from simple, to extremely complex products. My fault for not making that clearer.
    Aha, now I understand. indeed there are more meanings of scale..

    Communicating on a forum can be so difficult sometimes
    Not your fault.

  6. #81
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I always get myself into trouble on forums, don't worry. Sometimes I try and explain things which I just shouldn't be trying to explain on a forum

  7. #82
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wardrop View Post
    Mind elaborating on that. I believe I asked this earlier in the thread but don't believe I received a response.
    Code:
    SELECT products.* 
    FROM products 
    WHERE color = 'red' AND size = 'L'
    vs.
    Code:
    SELECT products.* 
    FROM products 
    INNER JOIN product_attributes as a 
    ON (a.product_id = products.id)
    INNER JOIN product_attributes as b
    ON (b.product_id = products.id)
    WHERE a.attribute = 'color' AND a.value = 'red'
    AND b.attribute = 'size' AND b.value = 'L'
    I believe (and could be wrong on this) that you would need one join for every attribute you need to specify as criteria in a search. Joins are expensive, and if you have to do a join for every attribute in a search, what should be twenty sections in a WHERE statement is now twenty joins. Basically, you're trying to do the database's job for it. Databases already have attributes and values: columns and row values. I'm sure somebody could give you a more technical explanation than that for why EAV is slow, but that's how I've come to understand it.

  8. #83
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Code:
    SELECT
    	title,
    	(products.base_price + attributes.price_diff) as price
    FROM
    	products
    LEFT JOIN attributes
    	ON attributes.product_id = products.id
    WHERE
    	attributes.attribute = 'colour' AND
    	attributes.value = 'green'
    That would be required for a schema like mine, so not really a problem considering there'd be an index on products.id, attributes.id and attributes.product_id. But I do see what you mean when using a more EAV-like schema such as matthijsa's. Actually can anyone determine for a fact that my schema is even EAV? What's the critical factor that determines if a schema is EAV or not?

  9. #84
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wardrop View Post
    Code:
    SELECT
        title,
        (products.base_price + attributes.price_diff) as price
    FROM
        products
    LEFT JOIN attributes
        ON attributes.product_id = products.id
    WHERE
        attributes.attribute = 'colour' AND
        attributes.value = 'green'
    That would be required for a schema like mine, so not really a problem considering there'd be an index on products.id, attributes.id and attributes.product_id. But I do see what you mean when using a more EAV-like schema such as matthijsa's. Actually can anyone determine for a fact that my schema is even EAV? What's the critical factor that determines if a schema is EAV or not?
    What happens when you need to select by two attributes? Then your query looks exactly like the second one I showed. Even with proper indexes that's a join per attribute that you wouldn't need otherwise.

    Your schema is as close to EAV as Matthijs' schema is. In fact, both schemas are so similar it's suprising there's even a debate over which is better/more applicable. EAV stands for Entity-Attribute-Value, and would have a column for each as standard. You are just splitting Entity out into another table-- products.

  10. #85
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Lets say, just for kicks, I decide to go with r937's schema and do this:

    Code:
    products
    ---------
    sku
    title
    description
    in_stock
    price_retail
    price_wholesale
    weight
    size
    color
    date
    I guess I would just like to discuss the issues that come to mind with this schema, and see how to solve them.

    For users & clients, a product will often encompass multiple rows in this table (I will refer to this as a 'UserProduct'). Thus, in order to present that unified UserProduct to them, we need a way to group those rows. Currently, the most unique field would be 'title'. What happens if a client renames a UserProduct to the same name as another? Suddenly both UserProducts would appear and be modified as one, even though they're different.

    Another issue, if editing or adding a single UserProduct, the name and description (which are generally the same across all products in a UserProduct) the name and description will be duplicated for every product that makes up that UserProduct. Maybe that's not a big deal?

    Another issue is products that have fixed options for certain attributes. (This may be something that has to be handled in the application, rather than the database-- I'd like to hear opinions on that). For example, the site I'm currently working on has wine bottles with 3 different volumes: 375ml, 750ml, and 1.5L. The application calculates the product weight based on these volumes. Ideally I could have some way to ensure a wine bottle product had volumes selected from those values.

    That also brings up another issue: A product seems to have a hierarchy of meaning: For example, a 375ml Blanc de Blanc is a specific product of a UserProduct 'Blanc de Blanc', which is from a specific wine maker, which is part of a general type 'wine bottle'. I need ways to distinguish between all of these somehow, and also apply general actions to the more course grained categories. For example, applying a weight to a certain volume wine bottle, or applying a certain title to a UserProduct.

    That's all for now, but if you guys can think of any more issues with that schema, feel free to bring them up. I'm not trying to find fault with it, but I'd like to have a good understanding of the tradeoffs involved, and get some ideas on how to handle the things I've mentioned above.

  11. #86
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by allspiritseve View Post
    What happens when you need to select by two attributes? Then your query looks exactly like the second one I showed.
    Yes, I see what you mean. I overlooked the point you were making.

    There's an alternative though (which would probably apply to matthijsa's schema also), which does essentially the same thing but gives you an extra bit of flexibility. Here it is...

    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products
    JOIN
    	(
    		SELECT DISTINCT
    			product_id,
    			attribute,
    			value
    		FROM attributes
    		WHERE
    			(attribute = 'colour' AND value = 'green')
    			OR (attribute = 'size' AND value = 'small')
    			OR (attribute = 'material' AND value = 'cotton')
    			OR (attribute = 'sleeves' AND value = 'short')
    	) as attributes
    	ON attributes.product_id = products.id
    GROUP BY
    	products.id
    HAVING
    	COUNT(*) >= 4
    I don't honestly know what would be faster, this or just multiple joins. A consultant at work who mainly specialises in SQL queries, believes the join option would be faster.

    Anyway, the extra bit of a flexibility I was talking about is the result of this portion of the query...

    Code:
    HAVING
    	COUNT(*) >= 4
    Setting this to 3 in the example above, would allow you to find all products that have at least 3 of the 4 attribute/value pairs. So if this method is slower, then at least it introduces a little more functionality (for use when it's needed).

  12. #87
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I couldn't help myself in the end. I had to setup a test database to run these two queries against. The results I got were surprising. Note that these queries are running against my schema which you can find on the 2nd page of this thread.

    Test Data

    I generated the test data from a php script. It uses random strings and tries to mimic the patterns of how products, attributes and values are generally stored. Each product has on average 3 or 4 attributes, which have on average have 3 or 4 values each (it varies between 1 to 6).

    Below are the first two queries I compared, starting with mine, followed by the more obvious solution, which I refer to as the 'Typcial Query' (both are searching on 4 attributes, which return exactly 1 result)...

    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products
    JOIN
    	(
    		SELECT DISTINCT
    			product_id,
    			attribute,
    			value
    		FROM attributes
    		WHERE
    			(attribute = 'colour' AND value = 'green')
    			OR (attribute = 'size' AND value = 'small')
    			OR (attribute = 'material' AND value = 'cotton')
    			OR (attribute = 'sleeves' AND value = 'short')
    	) as attributes
    	ON attributes.product_id = products.id
    GROUP BY
    	products.id
    HAVING
    	COUNT(*) >= 4
    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products 
    JOIN
    	attributes as a 
    	ON (a.product_id = products.id)
    JOIN
    	attributes as b
    	ON (b.product_id = products.id)
    JOIN
    	attributes as c
    	ON (c.product_id = products.id)
    JOIN
    	attributes as d
    	ON (d.product_id = products.id)
    WHERE
    	(a.attribute = 'colour' AND a.value = 'green')
    	AND (b.attribute = 'size' AND b.value = 'small')
    	AND (c.attribute = 'material' AND c.value = 'cotton')
    	AND (d.attribute = 'sleeves' AND d.value = 'short')
    The results for these two queries were as follows.

    With 1000 products, and 12,000 attribute/value entries, the queries took the following number of seconds (on average)...

    My Query = 0.0120
    Typical Query = 0.0322


    Then with 16,000 products, and 200,000 attribute/value entries...

    My Query = 0.4150
    Typical Query = 1.0100


    I then lowered the number of attributes that were being searched on, from 4, down to 2, otherwise keeping the queries exactly as they were. Here are those results (run on the large 16,000/200,000 database)...

    My Query = 0.4150
    Typical Query = 1.0000


    Then I decided to have both queries search for only 1 attribute. First, here are those queries...

    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products
    JOIN
    	(
    		SELECT DISTINCT
    			product_id,
    			attribute,
    			value
    		FROM attributes
    		WHERE
    			attribute = 'colour' AND value = 'green'
    	) as attributes
    	ON attributes.product_id = products.id
    GROUP BY
    	products.id
    HAVING
    	COUNT(*) >= 1
    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products 
    JOIN
    	attributes
    	ON (attributes.product_id = products.id)
    WHERE
    	attributes.attribute = 'colour' AND attributes.value = 'green'
    The results were once again surprising (queries were run the large 16,000/200,000 database)...

    My Query = 0.8100
    Typical Query = 1.0000


    As you can see, my query doubled in the amount of time it took, compared to when I was searching on 2 or more attributes. Curious, I tracked down the cause. It turned out to be the DISTINCT keyword in the sub-query. I'm unsure of why this is though. It's unneeded however when searching only on 1 attribute, so can be removed in such circumstances anyway. Removing the DISTINCT keyword brought the query time down to about 0.4000.

    In conclusion, my query seems to be more efficient (at least when working with large databases like these). It also seems the speed of both queries depends more on the number of rows that they're searching through, than the number of attributes they're searching for. Keep in mind though that this isn't real world data we're testing with, so these results should be questioned.
    Last edited by Wardrop; Jun 25, 2009 at 04:00.

  13. #88
    SitePoint Zealot
    Join Date
    Jun 2004
    Location
    Netherlands
    Posts
    172
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Those are interesting findings. It would be very interesting to expand these tests to different situations. Someone with too much free time should create a big matrix for different situations and their test results

    I'm surprised the queries run so reasonably well. Sure, 1 second is slow, but it is a big database (admittedly, "big" is relative).

  14. #89
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Keep in mind also, that searching on attributes isn't something you'd need to do very often. You'd probably only implement the feature as part of a search function in the back or/and front-end. As you said also matthijsA, this is a big database. Not too many online stores would be stocking so many products with so many attributes (it's the sheer number of attributes which has the biggest impact on speed). In a real-world scenario, you'll probably find that there'd usually be less rows in the attributes table than in the products table.

    To prove my point, I'll run the same two queries as I ran in the my previous post (I'll repeat them below anyway - "My query" is the first query), but this time with only 5000 rows in the attributes table (still just over 16,000 products in the product table)...

    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products
    JOIN
    	(
    		SELECT DISTINCT
    			product_id,
    			attribute,
    			value
    		FROM attributes
    		WHERE
    			(attribute = 'colour' AND value = 'green')
    			OR (attribute = 'size' AND value = 'small')
    			OR (attribute = 'material' AND value = 'cotton')
    			OR (attribute = 'sleeves' AND value = 'short')
    	) as attributes
    	ON attributes.product_id = products.id
    GROUP BY
    	products.id
    HAVING
    	COUNT(*) >= 4
    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products 
    JOIN
    	attributes as a 
    	ON (a.product_id = products.id)
    JOIN
    	attributes as b
    	ON (b.product_id = products.id)
    JOIN
    	attributes as c
    	ON (c.product_id = products.id)
    JOIN
    	attributes as d
    	ON (d.product_id = products.id)
    WHERE
    	(a.attribute = 'colour' AND a.value = 'green')
    	AND (b.attribute = 'size' AND b.value = 'small')
    	AND (c.attribute = 'material' AND c.value = 'cotton')
    	AND (d.attribute = 'sleeves' AND d.value = 'short')
    My Query = 0.0050
    Typical Query = 0.0039


    As we can size, the numbers have turned around to some extend. My query now takes a little longer. But as we've seen, my query scales better when you begin dealing with large numbers of attributes.

    By the way, for anyone interested, I've been running all these queries on my home workstation, which is configured as follows...

    OS: Windows 7 x64 RC1 (Build 7100)
    CPU: Q9450 (Quad-Core @ 2.66Ghz)
    RAM: 8GB
    Hard-Drive with SQL Data: (H drive) 3 year-old, 300GB western digital, standard 7200rpm SATA drive (only has MySQL and Apache on it).
    MySQL Version: 5.1.33

  15. #90
    Spirit Coder allspiritseve's Avatar
    Join Date
    Dec 2002
    Location
    Ann Arbor, MI (USA)
    Posts
    648
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Can you run a standard query with one products table that has columns for color, size, material, and sleeves? Also add indexes to those columns if you did on the attributes table. I'm curious to see how that stacks up.

  16. #91
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'll run the tests when I get home from work in about 8-9 hours from now.

  17. #92
    SitePoint Wizard Ren's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    1,060
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Rather than using a database for searching, have you looked at the facet searching abilities of something like Solr.

    http://wiki.apache.org/solr/SolrFacetingOverview

  18. #93
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok, I've inserted 12,000 products using the single table schema as suggested by allspiritsteve (out of his curiosity). None of the attributes columns have indexes, as per the tests run on the other schema.

    Note, while the number of attributes remains at 4, the number of possible values per attribute is between 1 and 3. Remember, such a schema requires the storage of all possible combinations of values, not just the possible values. As a result, the script I ran generated 391,000 rows, which totals 12,000 separate products (or, separate product titles).

    Moving on the query...

    Code:
    SELECT
    	title,
    	price
    FROM
    	products
    WHERE 
    	colour = 'green'
    	AND size = 'small'
    	AND material = 'cotton'
    	AND sleeves = 'short'
    This query took 1.1900 seconds to run on the 391,000 row table (without indexes on the attribute columns). Putting indexes on all attribute columns brought the query time down to only 0.0005 seconds, so you definitely want indexes on the attribute columns if you plan on doing such searching.

    Remember though, none of the other tests had indexes on any of the attribute or value columns. So, I put an index on both the attribute and value columns of that test table. Below are the results for 16,000 products and 200,000 attribute/value's...

    My Query = 0.0008 seconds
    Typical Query = 0.0006 seconds


    Here are queries (starting with "My Query")...

    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products
    JOIN
    	(
    		SELECT DISTINCT
    			product_id,
    			attribute,
    			value
    		FROM attributes
    		WHERE
    			(attribute = 'colour' AND value = 'green')
    			OR (attribute = 'size' AND value = 'small')
    			OR (attribute = 'material' AND value = 'cotton')
    			OR (attribute = 'sleeves' AND value = 'short')
    	) as attributes
    	ON attributes.product_id = products.id
    GROUP BY
    	products.id
    HAVING
    	COUNT(*) >= 4
    Code:
    SELECT
    	products.id,
    	products.title,
    	products.base_price
    FROM
    	products 
    JOIN
    	attributes as a 
    	ON (a.product_id = products.id)
    JOIN
    	attributes as b
    	ON (b.product_id = products.id)
    JOIN
    	attributes as c
    	ON (c.product_id = products.id)
    JOIN
    	attributes as d
    	ON (d.product_id = products.id)
    WHERE
    	(a.attribute = 'colour' AND a.value = 'green')
    	AND (b.attribute = 'size' AND b.value = 'small')
    	AND (c.attribute = 'material' AND c.value = 'cotton')
    	AND (d.attribute = 'sleeves' AND d.value = 'short')
    Well there you go. Just shows how important it is to have index's for columns being searched on. It also shows that the "Typical Query" is generally the better option, however, "My Query" can still be useful if you need to find all products with a certain amount of the defined attribute/values. For example (as explained in an earlier post), if I changed...

    Code:
    HAVING
    	COUNT(*) >= 4
    ...to..,
    Code:
    HAVING
    	COUNT(*) >= 3
    ...in my query above, then it would find all products with 3 or more of the defined attribute/value pairs, which could be handy if you're trying to implement a more flexible, or forgiving method of searching.

    As a note, all tests I've run so far have been on InnoDB tables.

  19. #94
    SQL Consultant gold trophysilver trophybronze trophy
    r937's Avatar
    Join Date
    Jul 2002
    Location
    Toronto, Canada
    Posts
    39,350
    Mentioned
    63 Post(s)
    Tagged
    3 Thread(s)
    Quote Originally Posted by Wardrop View Post
    As a note, all tests I've run so far have been on InnoDB tables.
    i am ~so~ not a DBA, but i've heard that timing tests also require that you either flush the database buffers in between tests (i don't know how to do that) or simply restart the server to accomplish the same thing
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
    "giving out my real stuffs"

  20. #95
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I didn't do any DB flushing. I did alternate the tests though to get the averages (ie. run query1, run query2, run query1, etc).

  21. #96
    SitePoint Addict Mastodont's Avatar
    Join Date
    Mar 2007
    Location
    Czech Republic
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    either flush the database buffers or simply restart the server
    Or disable query cache?


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •