Is it valid to split up a table for performance?

I am laying out the tables for my e-commerce site and am on the Customer table right now.

This table will not only have things which physically describe the Customer (e.g. FirstName, LastName, Age, etc.) but also a lot of “Customer Preferences” (e.g. ReceiveMailingsYN, FamilyOrientedOnly, ReceiveRecommendationsYN, ReceiveEzineYN, ShareInfoWithPartnersYN, CommunicationPreference, etc.)

Would it be okay to split my proposed “Customer” table up into two tables like this:

Customer

ID
Email
Password
FirstName
LastName
Age
YearOfBirth

CustomerPreferences

ID**
ReceiveMailingsYN
FamilyOrientedOnly
ReceiveRecommendationsYN
ReceiveEzineYN
ShareInfoWithPartnersYN
CommunicationPreference

**When a new Customer is created, an AutoNumber would create the “ID” in the “Customer” table and then that “ID” would be copied into the “CustomerPreferences” table using a stored procedure.

The benefit of this approach - possibly - is that I would have a streamlined table (i.e. “Customer”) to work with in most instances, and yet I would still have access to the Customer’s Preferences when needed.

This may be a really bad idea, but it just came to mind, since I fear having just one “Customer” table could end up having 100 fields in it.

Would do you think?

TomTees

So are you suggesting that if 90% of the requests need only five columns and the other 10% need all 50 columns that it is more efficient to split it into two tables and use a join for the 10% that need to read all the columns?

Please explain where the cost difference is that makes it slower to extract five fields out of a row that contains 50 columns compared to extracting the same five fields out of a row that only contains the five columns you are extracting.

yes there is

It does if the alternative is to join multiple tables. Accessing one table is always faster than accessing two to retrieve the same information so if you are really doing lots of reads of certain information then it may well be worthwhile to make the table wider so as to avoid having to constantly join tables together.

If you are not doing huge numbers of reads then splitting the table into what normalisation indicates is appropriate will do away with data duplication and hence make updates faster at the cost of making reads slower due to having to join multiple tables together.

The fastest read accesses will be to a short wide table (regardless of what fields you actually need to read since there is no cost difference between retrieving one field or 50) while the fastest writes are the ones where you don’t have to write the same thing multiple times…

Well this is sort of why I asked this question.

Maybe the certain sets of fields are more commonly accessed and others less so, and so splitting one “god table” not only makes it easier to manage, but may also improve performance.

Does having a lot of fields in a table degrade performance?

I mean obviously it is easier to search a table with 10 records versus 10 million.

But what about a table with 10 fields versus 100 or 1,000 fields?

If a “wide” table (i.e. a large # of fields) doesn’t affect performance, then that may be a deciding factor for me.

TomTees

it’a good topic to discuss. Actually it’s good if you have more tables in the database. You can handle and manage all the details according to the table names and it easily accessable. So it’s good to have more tables.

But it is still less efficient than having everything in the one table so that the SELECT subquery is unnecessary.

As Wyatt said - if you are going to split it into two tables then you split it between active and archived rows, you don’t split the columns.

For performance, almost certainly not.

For flexibility, perhaps (but watch out, things can get messy)

If you have a lot of different preferences that vary from customer to customer, preferences that most queries aren’t using, you might save some disk space by using two tables. (which in turn, might save some disk I/O and thus improve performance) of if the preference names are generated “on the fly” multiple tables are good.

But if you’re doing a lot of “show me customers with X preference set” then you’ll probably want them in a single table.

Of course, if a preference can have multiple values (one to many), you should split them.

Keep in mind, queries get much more complicated when you flip tables on their sides, which can degrade performance.

there has to be a join

That was what I was referring to. JOINs are more expensive than a SELECT query.

EDIT:

By doing a SELECT sub query instead of a JOIN you save the overhead of having to join all the rows and then filtering out what’s needed. So you are not scanning 2x the amount of rows.

if you’ve always got a 1:1 relationship between tables, why are there two tables?

Different customers may have different privileges. It all depends on the application at hand to really be able to answer that question.

This isn’t necessarily true.

Because a Customer may / will have a Preference all you need is a sub query in order to find out the information that is needed. And because there is one-to-one relation, using a primary key will keep the query optimized.

And how are the relations found between rows in the table? By row scanning the indexes for matches. :smiley:

I’m going to roughly paraphrase something Rudy stated in another thread - if you’ve always got a 1:1 relationship between tables, why are there two tables?

do a search for the phrase “premature optimization”

:cool:

I agree with Wyatt here. Splitting the table will only slow things down as now when you need the additional info, there has to be a join, which entails a scan of an entirely new index. So if you’ve got 10,000 users, you’re going to scan two 10,000 row indexes instead of one.

Using a relational database is not like the old days of hierarchical databases or flat text files. If you query for specific fields, then only those fields will be used/returned. It won’t return the whole record and only then filter out what you need.

A separate table will degrade performance, if anything, at that scale.

Most of the time, people “splitting” tables for performance are doing some sort of partitioning where they push old/stale/underutilized rows to a different table stored somewhere else.

And perhaps you two should take your personal issues offline so that the thread can continue on topic…

Well first you said “yes there is” and now you say “no i’m not” with regard to the exact same thing that is being discussed.

So which is it?

Is there a cost difference between having one or fifty rows in one table when you compare that to the alternative of having the columns split into multiple tables or isn’t there?

I still say that the difference between one and 50 columns in one table is nothing compared to the overhead of splitting the data into multiple tables.

Perhaps you ought to try reading everything that is said prior to jumping in, taking something out of context and contradicting that without regard to the fact that it is true in context.

You mean that I give an explanation and you deliberately misinterpret what I said in order to argue semantics rather than adding anything of significance to the original discussion.

Perhaps you ought to learn to read what is said a bit more carefully first before you make posts that are simply arguing semantics (or as in this case pulling something out of context in order to say it is wrong when in actual fact you agree that in the context to which it applies that it is actually right).

we’ve done this before, stephen (i gave an example which ultimately led you to try to change the meaning of natural key), and i’m not interested in doing it again

maybe we should let tom ask further questions on this topic

but which way is the cost difference? In one statement you imply that having the 50 columns in the one table costs more and then you implyt that splitting it into separate tables costs more.

Can you provide one example of where splitting 50 columns into separate tables will actually reduce the processing time taken to read the data and where having fewer columns in each table will therefore cost less than placing them all in the same table (note I said “read” since obviously normalising the data into separate tables increases the efficiency for “write” operations).

Under what circumstances does having more tables with fewer columns cost less than having one 50 column table when you are reading the data in order to justify your “yes there is” comment.