It’s not difficult to find an online tutorial on creating a database. It’s not hard to find a tutorial on how to create queries. In fact, it’s not hard to find tutorials on syntax for SQL, the differences of data types, and even which database is best for what purposes. It is, however, difficult to find good tutorials on the theories of designing, developing, and maintaining a quality database.

Database design and management isn’t very difficult. People much wiser than we have designed some very orderly and sound rules to follow and developed these rules into what is called the Normalization Process.

Using this process, you can create brand new, fully functional, finely tuned databases or take current database tables, run them through these steps, and come out with well-oiled tables ready to fly. However you use these steps, they are the fundamentals of quality database design.

Functional Dependence

Before we jump into the Normalization Process, I should take a step back and clear a few things up. First, this is not specific to any one type of database. These are rules that should be followed when using any database system, whether it is Oracle, MySQL, PostgreSQL, SQL Server, etc.

Let us first discuss Functional Dependence, which is crucial in understanding the Normalization Process. This is merely a big term for a relatively simple idea. To illustrate it, lets take a look at a small sample table.

Sales Rep Number Pay Class Rate
001 1 .05
002 1 .05
003 1 .05
004 2 .07
005 1 .05
006 3 .09

This relatively simple table is a good example of functional dependence, it can also be used to illustrate a point.

Definition: A column is functionally dependent on another column if a value ‘A’ determines a single value for ‘B’ at any one time.

Sound confusing? Let me explain. The field ‘Rate’ is functionally dependent on the field ‘Pay Class’. In other words, Pay Class determines Rate.

To determine functional dependency, you can think of it like this: Given a value for Field A, can you determine the single value for B? If B relies on A, then A is said to functionally determine B.

Taking the same table as above, lets add to it.

Name Sales Rep Number Pay Class Rate
Ward 001 1 .05
Maxim 002 1 .05
Cane 003 1 .05
Beechum 004 2 .07
Collins 005 1 .05
Cannery 006 3 .09

Now, lets look at this table and find some more Functional Dependencies. We already know that Pay Class determines Rate. We can also say that Sales Rep Number determines Last Name. Only one Sales Rep Number for each Last Name. This fits the definition of a Functional Dependency.

But does Last Name functionally determine anything? At first glance, some people might say yes, however, this is not true. Currently, you can say that Ward will only give you one Sales Rep Number, however, what if we hired another person with the name Ward? Then you would have two values for your Sales Rep Number, and then Last Name would no longer functionally determine anything.

On Keys

Now that we know what functional dependence is, we can clarify keys. Now, if you are working in databases, you probably already know what Primary Keys are. But, can you define them?

Definition: Column A is the primary key for table T if:
Property 1. All columns in T are functionally dependent on A
Property 2. No sub collections of the columns in table T also have Property 1.

This makes perfect sense. If all your fields in a database are dependent on one and only one field, then that field is the key. Now, occasionally Property 2 is broken, and two fields are candidates for the Primary Key. These keys are then called candidate keys. From these candidate keys, one key is chosen and the others are called alternate keys.

For example, in the same table as before:

Name Sales Rep Number Pay Class Rate
Ward 001 1 .05
Maxim 002 1 .05
Cane 003 1 .05
Beechum 004 2 .07
Collins 005 1 .05
Cannery 006 3 .09

Our primary key is the Sales Rep Number, as it fits the definition of a Primary Key. Everything in the table is dependent on the Sales Rep Number, and nothing else can claim the same thing. Now, let us take this one step further, and assume that we also have the Social Security number of the employee in the table as well.

Name Sales Rep Number Pay Class Rate Soc.Sec. no.
Ward 001 1 .05 133-45-6789
Maxim 002 1 .05 122-46-6889
Cane 003 1 .05 123-45-6999
Beechum 004 2 .07 113-75-6889
Collins 005 1 .05 121-44-6789
Cannery 006 3 .09 111-45-9339

Now, we have two Candidate Keys, Sales Rep Number and Social Security Number. So, what we have to decide is which field to use, as both will be unique. In the end, it would be best to have the Sales Rep Number as the Primary Key for various reasons.

First Normal Form

Now that we have clarified these concepts, lets move into the Normalization Process. The First Normal Form is defined as a table that does not contain repeating groups.

For example:

Order Number Order Date Part Number Number Ordered
5245 3/02/99 35436 23
5246 3/02/99 32116 11
5247 4/02/99 32133 52
12311 10
5248 4/02/99 46563 1

Now, from that example, you should easily see the problem. First off, in this setup, Order Number is considered the Primary Key, but this is not entirely true. The true Primary Key is Order Number and Part Number. In every order, a part will only ever be ordered once (though the amount ordered can be greater than 1). However, multiple parts can be ordered in on Order Number. So, when we retool this table, we end up getting this:

Order Number Order Date Part Number Number Ordered
5245 3/02/99 35436 23
5246 3/02/99 32116 11
5247 4/02/99 32133 52
5247 4/02/99 12311 10
5248 4/02/99 46563 1

This is much easier to understand, and is a much better design then the previous table. And this time, the Primary Key is both the Order Number and Part Number, and everything else in the table depends on that key.

This was the First Norm Form, or 1NF. Most people can do this right, however, I have seen some tables designed without this in mind, and for a while, it looks like things are all right, but when you start filling the table up with lots of information, it can become quite cumbersome. Definitely, not a shining point in someone’s career, but a most valuable lesson indeed.

Second Normal Form

First, lets just say this. The table is automatically 2NF if its Primary Key contains only one column. That was easy, wasn’t it? But then, if your Primary Key has more than one column, read on.

Lets jump right into the table here. Primary Keys are Order Number (Order #) and Part Number (Part #).

Order # Order Date Part # Part Desc. Number Ordered Price
5245 3/02/99 35465 Gas Key 1 $10
5246 3/02/99 65466 Lawn Chair 4 $25
5246 3/02/99 65473 Picnic Table 1 $30
5247 3/02/99 44654 Lawn Darts 1 $45
5248 4/02/99 44665 Volleyball 3 $20

At first glance, everything is okay with this table. However, there are some problems that need to be resolved. Lets first map out the functional dependencies.

Order Number -> Order Date
Part Number -> Part Description
Order Number, Part Number -> Number Ordered, Price, Order Date, Part Description

As you can see, this is a big mess, and needs to be taken care of. In fact, there are other problems with this table as well.

First, a change to the Part Description in this table would also mean a change in other tables, like the Products table. This is obviously cumbersome to code, and cumbersome to run.

Second, you could very well have different descriptions for the same part. This is simply not acceptable in a business environment.

This is because Part Description is dependent on only part of the primary key, and that is Part Number. Part Description is therefore a nonkey attribute. A nonkey attribute is simply a column that is not a part of the primary key.

This relates specifically to the 2NF definition, which is a table that is in the 1NF and no nonkey attribute is dependent on only a portion of the primary key. This of course reaffirms the fact that if the table only contains one column in its primary key, it is in second normal form.

So, how do we solve this problem? Well, we already solved this problem above. Here are the three new tables for the one above, with the first part being the primary keys and the second part being the other fields in the table.

Order Number -> Order Date
Part Number -> Part Description
Order Number, Part Number -> Number Ordered, Price

Those are our new tables. Notice that I took out the Order Date and the Part Description, as there is no need to use data that is already stored in other tables. Also note that you only have one description for each part. Quality database design needs no redundant data.

Third Normal Form

So, you think your table is clear now. You think it is ready for the big leagues? No…not yet. The 3NF is ready to take on new challenges. Lets go right to our example table:

Customer # Name Address Credit Sales Rep # Sales Rep Name
2343 Bob 2343 Wee St. $1000 03 Dave
2344 Daniel 1230 Wow St. $3000 04 Pam
2345 Kate 33 Soho $5000 03 Dave
2346 Less 99 More Is $1000 07 Garrett
2347 Dan 1 Less St. $500 09 Peter
2348 Bruno 45 Candy Ln. $2000 03 Dave

Now, you probably recognize a few problems, but be assured, this is 2NF. Every field depends on the Customer Number. For example, Customer Number 2345 will only have one name, address, credit, sales rep number, and sales rep name associated with it. However, this doesn’t mean the table is ready for the prime time.

First, we need to define a determinant as any column or collection of columns that determine another column. By this definition, and primary key, or any candidate key will be a determinant. Also, that would make Sales Rep Number a determinant, but it’s not the candidate key or a primary key.

So, how does this relate to the 3NF? Well, the definition of the 3NF is a table that complies with the 2NF (and of course, the 1NF) and if the only determinants it contains are candidate keys. This does, of course, include the primary key.

Now, you have just discovered the problem with the table: the determinant of Sales Rep Number. So how do you go about getting rid of it?

First, track down all the determinants that are NOT a candidate key. Then, remove all the fields that rely on this determinant key, but keep the determinant key in the table. Then, with all the fields you removed, put them into a table, with the primary key being the determinant key you left in the main table. So, that would mean our original table of:

Customer #, Name, Address, Credit, Sales Rep #, Sales Rep Name

Would be turned into these two tables:

Customer #, Name, Address, Credit, Sales Rep #
Sales Rep #, Sales Rep Name


Remember, these two new tables need to comply with 1NF, 2NF, and 3NF as well, so double check them, though if you are up to this point, you should be doing okay.

Fourth Normal Form

Finally, we are up to the big one. 4NF is the father of the forms, as it is the be all and end all. This takes care of any problem that may occur. Lets start this time by defining a very important term, multivalued dependence (MD). MD is when field B is multidependent on A if each value of A is associated with a specific list of values for B, and this collection is independent of any values of C.

Lets take a deep breath, and illustrate this with a simple table. Lets assume that Faculty represents the guidance counselor for any number of students represented by Student Number. Also, we can assume that the faculty member will be a part of any number of committees, and we want to store this as well. So, this is what we come up with.

Faculty Student Number Committee Code
1243 2343 ADV
  2345 PER
1553 3243 ADV
4003 3408 HSG

Now, passing this to 1NF would give you this result:

Faculty Student Number Committee Code
1243 2343 ADV
1243 2345 PER
1243 2345 HSG
1553 3243 ADV
4003 3408 HSG
4003 4095 HSG
4003 4403 HSG

But, you can clearly see the problem developing there as well. This is what MD is. If Faculty number 1243 was no longer counseling Student Number 2345, and we deleted the data, then the Committee information would be deleted as well. This is not quality.

A 4NF is 3NF compliant, and there are no multivalued dependencies.

You would handle this problem like in 2NF and 3NF, by splitting the table up into smaller tables with each containing the field that multidetermined them, and in this case, that would be the Faculty field. So, the new tables would be:

Faculty, Student Number
Faculty, Committee Code

Truly, that was a lot of work, but it is well worth it. Making sure your tables are optimized is key to quality design. Taking care of the problem before it becomes a problem is vital to managing a database.

Before you sit down to design the database, gather all the information you want to include in the database. I mean everything. Go around to each department of the company (or just write it out yourself if this is just for you) and find out what everyone wants in the database. Once you have everything, bring it back, and create one huge table.

From there, break that table down to 1NF, then 2NF, and so on. Go back over each table, and make sure they all work together, and are all 4NF tables. If they aren’t, then it can be assured the tables will suffer problems in the future.

Quality is in the design. And for those people who know, this helps comply with Codd’s first 2 rules for a truly relational database system.

Happy coding.

Jason Lotito
Jason is the Webmaster for Newbie Network, where those who can, teach, and those who can't, learn. Focusing on the newbies of the Net, is expanding to help newbie developers learn code.

No Reader comments

Related books & courses
Available now on SitePoint Premium

Preview for $1