Answers to Episode 4 (“What’s ‘normal’, really?”)

Jacob Kaplan-Moss

Well that was a raging success… not!

Apparently database normalization isn’t something that web developers find all that interesting. (But thanks to malikyte and xhtmlcoder for keeping the question from being a complete ghost town!)

That’s a shame, though — there are all sorts of pragmatic reasons behind good data design. To name just a few: properly designed tables often perform better than their de-normalized brethren, normalized data is much easier to aggregate successfully, and (most importantly) properly designed tables are much easier to understand.

That last one’s really the crux behind normalizing tables. Remember — computers don’t care if we write good code; when we write good code, it’s so that future developers won’t curse our names. Data normalization falls into the same future-proofing category.

Anyway, though: on to the answers. I’ll be brief, I promise.

  1. The FDA’s nutritional content database is — to my utter surprise — actually 3NF (everything has a primary key, and every piece of data appears to be atomic). I have some quibbles with a couple of the design choices, but they’re actually pretty minor. It’s pretty remarkable when you come across data this clean out of the box.
  2. Although the population demographic data is fairly well designed (and easy to munge into better forms), it doesn’t even achieve 1NF: records lack primary keys. This is usually the case with public data, and it stinks; it makes tracking changes from version to version extremely difficult.
  3. The SEC filings were a trick question. They’re in a XML dialect, so normal forms don’t apply.

    I think it’s important to notice how different formats change the way we can produce and consume data; the SEC data is a pretty good example of well done XML, but it would be pretty difficult trying to push this data into a database in any sort of structured way.

    If I wanted to build a site around this stuff, I’d likely use something like Berkeley DB for XML instead of a relational database.

  4. The gas price data, though crammed into a Excel sheet used more for presentation than data management, is actually in 3NF (if only because it’s pre-aggregated data). The data of the measurement is the primary key, and all columns are singularly dependent on the primary key (i.e. price is a function of date, and nothing more).
  5. The Juvenile Arrest Rate data is, like the gas prices, nominally 3NF data crammed into Excel.

Coming up…

Tune in tomorrow for a special super-difficult (I hope) challenge to keep us all occupied over the holidays.

As always, if youâ??ve got a question, puzzle, or challenge that you think would make a good question for this quiz, email me at jacob -at-