matthijsA: EAV is explicitly mentioned in my last referenced link, that's why I posted it.
So the question is, does that schema scale or not?
This can be AFAIK verified only on live database. Try have a look at Magento's db schema.
Printable View
matthijsA: EAV is explicitly mentioned in my last referenced link, that's why I posted it.
So the question is, does that schema scale or not?
This can be AFAIK verified only on live database. Try have a look at Magento's db schema.
In reply to matthijsA, that's pretty much what I said in this recent post...
http://www.sitepoint.com/forums/show...2&postcount=59
Maybe you said it in some way, but you keep repeating that my schema does not scale at all. And then you explain that it doesn't scale by applying my schema to your example of the computer shop in which there are endless possible combinations of products.
However, the important point is that we are talking about different kinds of shops. I am talking about a shop with a "limited" amount of products. "Limited", can still be tens of thousands of products, like a big clothing shop or online book store. You are talking about a shop with almost endless combinations of products (order this motherboard with this memory with this case, etc).
It's quite an important distinction to make. Allsprititeve, what kind of store are you looking at to build? Is it more a store with a limited amount of products? Or is it a store with endless combinations of products/attributes? Is it an e-commerce application which should be able to handle every situation?
I always get myself into trouble on forums, don't worry. Sometimes I try and explain things which I just shouldn't be trying to explain on a forum :)
vs.Code:SELECT products.*
FROM products
WHERE color = 'red' AND size = 'L'
I believe (and could be wrong on this) that you would need one join for every attribute you need to specify as criteria in a search. Joins are expensive, and if you have to do a join for every attribute in a search, what should be twenty sections in a WHERE statement is now twenty joins. Basically, you're trying to do the database's job for it. Databases already have attributes and values: columns and row values. I'm sure somebody could give you a more technical explanation than that for why EAV is slow, but that's how I've come to understand it.Code:SELECT products.*
FROM products
INNER JOIN product_attributes as a
ON (a.product_id = products.id)
INNER JOIN product_attributes as b
ON (b.product_id = products.id)
WHERE a.attribute = 'color' AND a.value = 'red'
AND b.attribute = 'size' AND b.value = 'L'
That would be required for a schema like mine, so not really a problem considering there'd be an index on products.id, attributes.id and attributes.product_id. But I do see what you mean when using a more EAV-like schema such as matthijsa's. Actually can anyone determine for a fact that my schema is even EAV? What's the critical factor that determines if a schema is EAV or not?Code:SELECT
title,
(products.base_price + attributes.price_diff) as price
FROM
products
LEFT JOIN attributes
ON attributes.product_id = products.id
WHERE
attributes.attribute = 'colour' AND
attributes.value = 'green'
What happens when you need to select by two attributes? Then your query looks exactly like the second one I showed. Even with proper indexes that's a join per attribute that you wouldn't need otherwise.
Your schema is as close to EAV as Matthijs' schema is. In fact, both schemas are so similar it's suprising there's even a debate over which is better/more applicable. EAV stands for Entity-Attribute-Value, and would have a column for each as standard. You are just splitting Entity out into another table-- products.
Lets say, just for kicks, I decide to go with r937's schema and do this:
I guess I would just like to discuss the issues that come to mind with this schema, and see how to solve them.Code:products
---------
sku
title
description
in_stock
price_retail
price_wholesale
weight
size
color
date
For users & clients, a product will often encompass multiple rows in this table (I will refer to this as a 'UserProduct'). Thus, in order to present that unified UserProduct to them, we need a way to group those rows. Currently, the most unique field would be 'title'. What happens if a client renames a UserProduct to the same name as another? Suddenly both UserProducts would appear and be modified as one, even though they're different.
Another issue, if editing or adding a single UserProduct, the name and description (which are generally the same across all products in a UserProduct) the name and description will be duplicated for every product that makes up that UserProduct. Maybe that's not a big deal?
Another issue is products that have fixed options for certain attributes. (This may be something that has to be handled in the application, rather than the database-- I'd like to hear opinions on that). For example, the site I'm currently working on has wine bottles with 3 different volumes: 375ml, 750ml, and 1.5L. The application calculates the product weight based on these volumes. Ideally I could have some way to ensure a wine bottle product had volumes selected from those values.
That also brings up another issue: A product seems to have a hierarchy of meaning: For example, a 375ml Blanc de Blanc is a specific product of a UserProduct 'Blanc de Blanc', which is from a specific wine maker, which is part of a general type 'wine bottle'. I need ways to distinguish between all of these somehow, and also apply general actions to the more course grained categories. For example, applying a weight to a certain volume wine bottle, or applying a certain title to a UserProduct.
That's all for now, but if you guys can think of any more issues with that schema, feel free to bring them up. I'm not trying to find fault with it, but I'd like to have a good understanding of the tradeoffs involved, and get some ideas on how to handle the things I've mentioned above.
Yes, I see what you mean. I overlooked the point you were making.
There's an alternative though (which would probably apply to matthijsa's schema also), which does essentially the same thing but gives you an extra bit of flexibility. Here it is...
I don't honestly know what would be faster, this or just multiple joins. A consultant at work who mainly specialises in SQL queries, believes the join option would be faster.Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
(
SELECT DISTINCT
product_id,
attribute,
value
FROM attributes
WHERE
(attribute = 'colour' AND value = 'green')
OR (attribute = 'size' AND value = 'small')
OR (attribute = 'material' AND value = 'cotton')
OR (attribute = 'sleeves' AND value = 'short')
) as attributes
ON attributes.product_id = products.id
GROUP BY
products.id
HAVING
COUNT(*) >= 4
Anyway, the extra bit of a flexibility I was talking about is the result of this portion of the query...
Setting this to 3 in the example above, would allow you to find all products that have at least 3 of the 4 attribute/value pairs. So if this method is slower, then at least it introduces a little more functionality (for use when it's needed).Code:HAVING
COUNT(*) >= 4
I couldn't help myself in the end. I had to setup a test database to run these two queries against. The results I got were surprising. Note that these queries are running against my schema which you can find on the 2nd page of this thread.
Test Data
I generated the test data from a php script. It uses random strings and tries to mimic the patterns of how products, attributes and values are generally stored. Each product has on average 3 or 4 attributes, which have on average have 3 or 4 values each (it varies between 1 to 6).
Below are the first two queries I compared, starting with mine, followed by the more obvious solution, which I refer to as the 'Typcial Query' (both are searching on 4 attributes, which return exactly 1 result)...
Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
(
SELECT DISTINCT
product_id,
attribute,
value
FROM attributes
WHERE
(attribute = 'colour' AND value = 'green')
OR (attribute = 'size' AND value = 'small')
OR (attribute = 'material' AND value = 'cotton')
OR (attribute = 'sleeves' AND value = 'short')
) as attributes
ON attributes.product_id = products.id
GROUP BY
products.id
HAVING
COUNT(*) >= 4
The results for these two queries were as follows.Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
attributes as a
ON (a.product_id = products.id)
JOIN
attributes as b
ON (b.product_id = products.id)
JOIN
attributes as c
ON (c.product_id = products.id)
JOIN
attributes as d
ON (d.product_id = products.id)
WHERE
(a.attribute = 'colour' AND a.value = 'green')
AND (b.attribute = 'size' AND b.value = 'small')
AND (c.attribute = 'material' AND c.value = 'cotton')
AND (d.attribute = 'sleeves' AND d.value = 'short')
With 1000 products, and 12,000 attribute/value entries, the queries took the following number of seconds (on average)...
My Query = 0.0120
Typical Query = 0.0322
Then with 16,000 products, and 200,000 attribute/value entries...
My Query = 0.4150
Typical Query = 1.0100
I then lowered the number of attributes that were being searched on, from 4, down to 2, otherwise keeping the queries exactly as they were. Here are those results (run on the large 16,000/200,000 database)...
My Query = 0.4150
Typical Query = 1.0000
Then I decided to have both queries search for only 1 attribute. First, here are those queries...
Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
(
SELECT DISTINCT
product_id,
attribute,
value
FROM attributes
WHERE
attribute = 'colour' AND value = 'green'
) as attributes
ON attributes.product_id = products.id
GROUP BY
products.id
HAVING
COUNT(*) >= 1
The results were once again surprising (queries were run the large 16,000/200,000 database)...Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
attributes
ON (attributes.product_id = products.id)
WHERE
attributes.attribute = 'colour' AND attributes.value = 'green'
My Query = 0.8100
Typical Query = 1.0000
As you can see, my query doubled in the amount of time it took, compared to when I was searching on 2 or more attributes. Curious, I tracked down the cause. It turned out to be the DISTINCT keyword in the sub-query. I'm unsure of why this is though. It's unneeded however when searching only on 1 attribute, so can be removed in such circumstances anyway. Removing the DISTINCT keyword brought the query time down to about 0.4000.
In conclusion, my query seems to be more efficient (at least when working with large databases like these). It also seems the speed of both queries depends more on the number of rows that they're searching through, than the number of attributes they're searching for. Keep in mind though that this isn't real world data we're testing with, so these results should be questioned.
Those are interesting findings. It would be very interesting to expand these tests to different situations. Someone with too much free time should create a big matrix for different situations and their test results :)
I'm surprised the queries run so reasonably well. Sure, 1 second is slow, but it is a big database (admittedly, "big" is relative).
Keep in mind also, that searching on attributes isn't something you'd need to do very often. You'd probably only implement the feature as part of a search function in the back or/and front-end. As you said also matthijsA, this is a big database. Not too many online stores would be stocking so many products with so many attributes (it's the sheer number of attributes which has the biggest impact on speed). In a real-world scenario, you'll probably find that there'd usually be less rows in the attributes table than in the products table.
To prove my point, I'll run the same two queries as I ran in the my previous post (I'll repeat them below anyway - "My query" is the first query), but this time with only 5000 rows in the attributes table (still just over 16,000 products in the product table)...
Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
(
SELECT DISTINCT
product_id,
attribute,
value
FROM attributes
WHERE
(attribute = 'colour' AND value = 'green')
OR (attribute = 'size' AND value = 'small')
OR (attribute = 'material' AND value = 'cotton')
OR (attribute = 'sleeves' AND value = 'short')
) as attributes
ON attributes.product_id = products.id
GROUP BY
products.id
HAVING
COUNT(*) >= 4
My Query = 0.0050Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
attributes as a
ON (a.product_id = products.id)
JOIN
attributes as b
ON (b.product_id = products.id)
JOIN
attributes as c
ON (c.product_id = products.id)
JOIN
attributes as d
ON (d.product_id = products.id)
WHERE
(a.attribute = 'colour' AND a.value = 'green')
AND (b.attribute = 'size' AND b.value = 'small')
AND (c.attribute = 'material' AND c.value = 'cotton')
AND (d.attribute = 'sleeves' AND d.value = 'short')
Typical Query = 0.0039
As we can size, the numbers have turned around to some extend. My query now takes a little longer. But as we've seen, my query scales better when you begin dealing with large numbers of attributes.
By the way, for anyone interested, I've been running all these queries on my home workstation, which is configured as follows...
OS: Windows 7 x64 RC1 (Build 7100)
CPU: Q9450 (Quad-Core @ 2.66Ghz)
RAM: 8GB
Hard-Drive with SQL Data: (H drive) 3 year-old, 300GB western digital, standard 7200rpm SATA drive (only has MySQL and Apache on it).
MySQL Version: 5.1.33
Can you run a standard query with one products table that has columns for color, size, material, and sleeves? Also add indexes to those columns if you did on the attributes table. I'm curious to see how that stacks up.
I'll run the tests when I get home from work in about 8-9 hours from now.
Rather than using a database for searching, have you looked at the facet searching abilities of something like Solr.
http://wiki.apache.org/solr/SolrFacetingOverview
Ok, I've inserted 12,000 products using the single table schema as suggested by allspiritsteve (out of his curiosity). None of the attributes columns have indexes, as per the tests run on the other schema.
Note, while the number of attributes remains at 4, the number of possible values per attribute is between 1 and 3. Remember, such a schema requires the storage of all possible combinations of values, not just the possible values. As a result, the script I ran generated 391,000 rows, which totals 12,000 separate products (or, separate product titles).
Moving on the query...
This query took 1.1900 seconds to run on the 391,000 row table (without indexes on the attribute columns). Putting indexes on all attribute columns brought the query time down to only 0.0005 seconds, so you definitely want indexes on the attribute columns if you plan on doing such searching.Code:SELECT
title,
price
FROM
products
WHERE
colour = 'green'
AND size = 'small'
AND material = 'cotton'
AND sleeves = 'short'
Remember though, none of the other tests had indexes on any of the attribute or value columns. So, I put an index on both the attribute and value columns of that test table. Below are the results for 16,000 products and 200,000 attribute/value's...
My Query = 0.0008 seconds
Typical Query = 0.0006 seconds
Here are queries (starting with "My Query")...
Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
(
SELECT DISTINCT
product_id,
attribute,
value
FROM attributes
WHERE
(attribute = 'colour' AND value = 'green')
OR (attribute = 'size' AND value = 'small')
OR (attribute = 'material' AND value = 'cotton')
OR (attribute = 'sleeves' AND value = 'short')
) as attributes
ON attributes.product_id = products.id
GROUP BY
products.id
HAVING
COUNT(*) >= 4
Well there you go. Just shows how important it is to have index's for columns being searched on. It also shows that the "Typical Query" is generally the better option, however, "My Query" can still be useful if you need to find all products with a certain amount of the defined attribute/values. For example (as explained in an earlier post), if I changed...Code:SELECT
products.id,
products.title,
products.base_price
FROM
products
JOIN
attributes as a
ON (a.product_id = products.id)
JOIN
attributes as b
ON (b.product_id = products.id)
JOIN
attributes as c
ON (c.product_id = products.id)
JOIN
attributes as d
ON (d.product_id = products.id)
WHERE
(a.attribute = 'colour' AND a.value = 'green')
AND (b.attribute = 'size' AND b.value = 'small')
AND (c.attribute = 'material' AND c.value = 'cotton')
AND (d.attribute = 'sleeves' AND d.value = 'short')
...to..,Code:HAVING
COUNT(*) >= 4
...in my query above, then it would find all products with 3 or more of the defined attribute/value pairs, which could be handy if you're trying to implement a more flexible, or forgiving method of searching.Code:HAVING
COUNT(*) >= 3
As a note, all tests I've run so far have been on InnoDB tables.
I didn't do any DB flushing. I did alternate the tests though to get the averages (ie. run query1, run query2, run query1, etc).
either flush the database buffers or simply restart the server
Or disable query cache?