Well that was a raging success… not!

Apparently database normalization isn’t something that web developers find all that interesting. (But thanks to malikyte and xhtmlcoder for keeping the question from being a complete ghost town!)

That’s a shame, though — there are all sorts of pragmatic reasons behind good data design. To name just a few: properly designed tables often perform better than their de-normalized brethren, normalized data is much easier to aggregate successfully, and (most importantly) properly designed tables are much easier to understand.

That last one’s really the crux behind normalizing tables. Remember — computers don’t care if we write good code; when we write good code, it’s so that future developers won’t curse our names. Data normalization falls into the same future-proofing category.

Anyway, though: on to the answers. I’ll be brief, I promise.

  1. The FDA’s nutritional content database is — to my utter surprise — actually 3NF (everything has a primary key, and every piece of data appears to be atomic). I have some quibbles with a couple of the design choices, but they’re actually pretty minor. It’s pretty remarkable when you come across data this clean out of the box.
  2. Although the population demographic data is fairly well designed (and easy to munge into better forms), it doesn’t even achieve 1NF: records lack primary keys. This is usually the case with public data, and it stinks; it makes tracking changes from version to version extremely difficult.
  3. The SEC filings were a trick question. They’re in a XML dialect, so normal forms don’t apply.

    I think it’s important to notice how different formats change the way we can produce and consume data; the SEC data is a pretty good example of well done XML, but it would be pretty difficult trying to push this data into a database in any sort of structured way.

    If I wanted to build a site around this stuff, I’d likely use something like Berkeley DB for XML instead of a relational database.

  4. The gas price data, though crammed into a Excel sheet used more for presentation than data management, is actually in 3NF (if only because it’s pre-aggregated data). The data of the measurement is the primary key, and all columns are singularly dependent on the primary key (i.e. price is a function of date, and nothing more).
  5. The Juvenile Arrest Rate data is, like the gas prices, nominally 3NF data crammed into Excel.

Coming up…

Tune in tomorrow for a special super-difficult (I hope) challenge to keep us all occupied over the holidays.

As always, if youâ??ve got a question, puzzle, or challenge that you think would make a good question for this quiz, email me at jacob -at- jacobian.org.

  • malikyte

    Is there any sort of trick in understand how a spreadsheet is displayed in a normal form, and identifying the keys? I seriously thought it was completely presentational and couldn’t follow it a single bit. Perhaps one of my problems was that I’m using the Office 2007 beta and couldn’t figure out where the PivotTable option was to see if the data was laid out from certain aspects…that is, if that’s how you managed to discover the normal form.

    I’m afraid the answer will be as simple as “Google it!”, but I’m imagining there’s more to it than that.

  • http://www.realityedge.com.au mrsmiley

    I think the key to getting people involved is having questions that can stand on their own. The problem with getting involved in this one is it required you to have done your research on the first episode. If they are going to continue to build on each other, you will probably find the participation level dropping as well.

    Episode 3 was a raging success because you could read it, have a crack at it and get back on with work without too much effort. I cant vouch for anyone else reading these, but I check this while I’m at work. I don’t have time for anything more complicated than a “quiz”. As soon as it starts heading down the “exam” route I give up and get back to work.

    There really is two camps here, the designers and the engineers (developers). I read a case study once on creating a viral marketing campaign for engineers. The trick was to create something that was enough of a challenge for the engineer ego that catered for all levels of skill, yet quick enough to complete to increase the participation level. From memory they were asked to design a trebuchet (catapult) to throw a particular object the furthest.

    For majority of web devs databases are something you just have to deal with, not necessarily something they necessarily find intresting. For me personally, I find the storage of the data interesting coming from a content management background, but I couldn’t care less what the form of normalisation it is in. Mind you, I’ve never been one heavy into the theory side of things. Unless you are going to specialise in a particular field or devote 10 hrs a day to research, the world of web development only affords you enough time to learn to get the job done and hopefully learn something for next time. As you so rightly pointed out at the start of this, we are required to be experts in too many areas.

  • malikyte

    I think mrsmiley hit some good points. I too read this while at work. I felt bad that the last entry was a ghost town of replies, so I spent a few moments after I got home to at least do a quick overview of the underlying data. I am interested to learn this stuff, but unless it’s literally homework from a course I’ve paid for, I don’t think I’d necessarily want to spend the time when I could spend that time researching things I’m currently going down the path on.

    I do, however, find any and all previously unknown information to be very valuable, no matter the form it’s in.

  • http://www.xhtmlcoder.com/ xhtmlcoder

    It is good to learn Normalisation and the “Entity-relationship model” even if you don’t use them directly on databases. I suspect the majority of people haven’t studied normalisation so wouldn’t have known how to start to tackle the questions in an efficient manner.

    For a general quiz it probably went a little deep and if I hadn’t learnt normalisation 5-year ago I might have been stumped.

  • lepezdok

    As for the brevity of answers — I would actually prefer to read a somewhat more extended version for each of those examples. It’s like in a lecture if a prof asks a difficult question, and nobody answers, he should probably spend some time explaining. Otherwise, what is the point?

Special Offer
Free course!

Git into it! Bonus course Introduction to Git is yours when you take up a free 14 day SitePoint Premium trial.