Multiple Values in One Column or Many Columns?

Pamela1991 · August 25, 2012, 4:15pm

Hi all,

If I want to store and retrieve the following data in a database, what is the optimal way to do so–I am asking specifically for the “Fruits_you_like” column?

Name
Phone_Number
Email_Address
Fruits_you_like

Fruits_you_like will consist of 30 different fruits. Would it be best to have one column for each fruit and use boolean 1/0 like this:

Name, Phone_Number, Email_Address, Apples, Oranges, Pears, Cherries, Strawberries

Or would it be better to only have four columns (Name, Phone_Number, Email_Address, Fruits_you_like)? If it is best to use only four columns, what should the Fruits_you_like column be? An array? Multi-dimensional array?

Thanks!

Pam

DarthGuido · August 25, 2012, 8:29pm

Neither. This is a one to many relationship (one person can have many favourite fruits), so you’ll need another table, let’s call it fruits_you_like.

The database will become:

users: ID (primary key), Name, Phone_Number, Email_Address

fruits_you_like: userID (foreign key), fruit

That way, you can always add new fruits to the list. And you can even do searches like “all users that like apples” easily.

Pamela1991 · August 25, 2012, 9:11pm

Thanks Guido! That makes sense. Now I understand why WordPress tags are like that in the WP database.

I just have a couple of questions, though. If you have over a million users with each user having at least 5 favorite fruits, then that additional table (fruits_you_like) is going to have well over 5 million rows. If this is the case, will I have to do some kind of additional database optimization due to the high number of records (if 5 million records isn’t considered that high, how many records would be considered high enough that we would have to start doing some other kinds of optimizations, and what optimizations would we have to do?)?

My other question is, let’s say that in the users table there is another column–“country”. Would it be best to simply put the whole country’s name here (e.g. Australia, Benin, Chile, Denmark, etc.) or a number for each country, tying that to a new table:

users: ID (primary key), Name, Phone_Number, Email_Address, country_id
fruits_you_like: userID (foreign key), fruit
countries: ID (primary key), country_name, country_id

If it is better to make a new “countries” table, should there be any ties within the actual database, or will the only ties be made in the PHP code?

Grazie!

Pamela1991 · August 26, 2012, 2:37am

This word (partition) slipped my mind when I wrote the above post, but would partitioning come into play in my example of over 5 million records in a table?

Edit: OK, it seems that for a table of only 5 million records, I won’t need to partition (like this 100 million record table, or [URL=“http://www.sitepoint.com/forums/showthread.php?705726-Wordpress-Database-too-big&p=4787558&viewfull=1#post4787558”]5 billion rows), but I am trying to make sure that I plan and design my database correctly so that I don’t run into any kind of problems in the future that I could have easily avoided with a solid game plan.

Any advice?

Edit again: multi-user database (possible lead)

Dr_John · August 26, 2012, 5:27pm

countries: ID (primary key), country_name, country_id

How many countries are there that have the same name???
None that I know of.
So country_name, or just country to reduce typing, is totally unique.
So why force the need for a join to get at the data??? AND why have an ID and a country_id??? Both of which would also be unique I’d guess.

Really you are just creating a look-up table, something which is used to populate a drop-down list in a data entry form to get consistent spelling. (I’m assuming that your users will be entering data using a form on your site.) So your table called country then has only one attribute/field in it, called country.
There is no need what so ever to invent two other unique attributes to describe / identify something that is already unique by itself.

And of course the same applies if you use a look-up table of fruits you know about to populate a data entry form for the user to use. (although you could also let users suggest a new fruit to be added to the fruit table if they can’t find their favourite one in the drop-down list, say for example a Durian, or Ugli, or Dragon Fruit, or Black Sapote, or Canistel, or Jaboticaba to name a few that aren’t an Apple (watch out for rounded corners) or an Orange.)

Pamela1991 · August 26, 2012, 9:26pm

Dr_John:

countries: ID (primary key), country_name, country_id

How many countries are there that have the same name???
None that I know of.
So country_name, or just country to reduce typing, is totally unique.
So why force the need for a join to get at the data??? AND why have an ID and a country_id??? Both of which would also be unique I’d guess.

Really you are just creating a look-up table, something which is used to populate a drop-down list in a data entry form to get consistent spelling. (I’m assuming that your users will be entering data using a form on your site.) So your table called country then has only one attribute/field in it, called country.
There is no need what so ever to invent two other unique attributes to describe / identify something that is already unique by itself.

And of course the same applies if you use a look-up table of fruits you know about to populate a data entry form for the user to use. (although you could also let users suggest a new fruit to be added to the fruit table if they can’t find their favourite one in the drop-down list, say for example a Durian, or Ugli, or Dragon Fruit, or Black Sapote, or Canistel, or Jaboticaba to name a few that aren’t an Apple (watch out for rounded corners) or an Orange.)

Silly me! I forgot there’s no point in having country_id since there’s already ID!

The reason I was thinking I should make a separate table for country and then put the relevant ID in the users table is because I thought it would be better for the database; instead of having 5 million instances of country names entirely spelled out in the users table, there would be 5 million instances 1-3 digits that represent the respective country. Less redundancy. I had thought this was less strain on the database, but it looks like I’m wrong about that, and that there is actually no need to make a separate table for the list of countries?

Well, how about in the following situation. Let’s say you are using this list–specifically the columns: zip_code, city, state. In this case, would it be best to have my database set up as follows:

users: ID (primary key), Name, Phone_Number, Email_Address, zip_code (foreign key)
fruits_you_like: userID (foreign key), fruit
zip_code: ID (primary key), zip_code, city, state

I say this because although there are many cities with the exact same name, according to the zip code list I linked above, there is a unique zip code for each city (and multiple zip codes for some cities).

So during registration, users would be required to give their name, phone number, email address, zip code, and favorite fruit. Then someone could easily search on the site for all users in Omaha that like Jackfruit because I already have Omaha in the zip_code table tied up with its respective zip code and state.

Or am I complicating things, going about it the wrong way, or otherwise missing out on something?

Thanks!

Dr_John · August 30, 2012, 8:10pm

Fruit is an attribute of the user (the one fruit they like most, on my reading of this).
so it should be in the user table.

The fruit table should be used for look-ups by populating a drop-down list, and as fruit names are unique, you should be using the fruit as the primary key in this table - ie a one column table. Then populate the drop-down for a data entry form via the fruit table. This leads to simpler queries and no need for a join.

Imagine the question what’s your favorite fruit?
Is the answer 123 or Jaboticaba?

If you use a numeric id for a fruit which has a unique name, you have to join two tables together to get at the fruit. In your suggested fruit table above, you would have to join on the userID, instead of simply reading it from the user table. Also, having a table with multiple instances or orange is no more redundant that having multiple instances of fruitID 123. But adding the id to the fruit table creates redundant, unnecessary data in the table. You’re worrying about the physical size of the table when disk space is sooo cheap it isn’t an issue.

You asked if you were missing something - you are over-normalising the user table by pulling out the fruit and then needing to use userid/fruit to identify it, and it seems under the impression that every table needs a numeric id. For example, in your zip code table, every zip code is unique, but you have added a numeric id, thus forcing a join to get at the actual zip code.

So then you have the question what’s your zipcode (in the uk it’s a post code)?
Is it 12345, an id that you have invented and then have to do a join and use to get at the code itself??? And could confuse with a real zipcode. Or is it 00501? So if the zip code was the primary key in it’s own table and the foreign key in the user table, you don’t need a join to be able to report the user’s zipcode. So queries become simpler, and easier to troubleshoot.

so I’d use this set of tables

users: ID (primary key), Name, Phone_Number, Email_Address, zip_code (foreign key), fruit (foreign key)
fruits_you_like: fruit (primary key)
zip_code: zip_code (primary key) , city, state

r937 · August 30, 2012, 9:02pm

sadly, a misapprehension that is rather common

nice post, dr john :award:

Pamela1991 · August 31, 2012, 3:21am

Thanks Dr John!

I had no idea that you didn’t need a separate ID for each table, but that makes perfect sense now. But does it matter what type the primary column is (e.g. VARCHAR, FLOAT, or even BLOB) just as long as the contents are unique (i.e. not BOOL)?

Should the zip_code column be VARCHAR(5) because of zip codes that begin with 0 (e.g. 00501) as INT is removing the initial zeros?

Sorry, in my last post I missed the ‘s’ in fruit. So during registration, users would be required to give their name, phone number, email address, zip code, and favorite fruits (between 1 to 5 of them). As such, would the tables be:

users: ID (primary key), Name, Phone_Number, Email_Address, zip_code (foreign key), fruit (foreign key)
fruits_you_like: userID (foreign key), fruit
zip_code: zip_code (primary key) , city, state

Dr_John · August 31, 2012, 9:25pm

Zip codes are not really numbers - does adding two of them together make sense? Subtracting them? These are not meaningful operations on a zip code, so varchar is correct. It’s the same with telephone numbers.
Also setting them as varchar means they leading 0s will not get dropped.

Once you say five fav. fruits, your fruits_you_like table now makes sense. But calling it just fruitliked would be simpler and less liable to typos in your queries. Also queries on the fruit are now very very much easier than having fruit1,fruit2, fruit3 etc to check.

I’d STILL have a table called fruit to feed the data entry form and to remove typos when spelling they fav fruit.
user–<fruitliked>–fruit , if you know what that means.

PS I would not use a blob as a PK… but varchar and numbers are okay. Not Float, as they are not defined accurately (conversion of decimal numbers to binaries is not always possible, same way that 1/7 doesn’t convert to decimals)

Pamela1991 · September 3, 2012, 12:38pm

Thanks Dr John! You have been very helpful.