Episode 4: What’s “normal,” really?

By Jacob Kaplan-Moss
We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now

Sorry about the missed week, fellow puzzlers — real life, and all that — I’ll try not to let it happen again.

Anyway, let’s get right to this week’s question. A few weeks ago, I posted a scavenger hunt for public data (answers); today we’ll return to dealing with that data.

Of course, finding a viable source of data is only the first step; once you’ve figured out what to use, you have to figure out how to use it. Since I’m a certified database geek, the first thing I do once I’ve got some sweet data in hand is start thinking about database design.

When we talk database design, we’re usually talking about formal database normalization, and specifically first, second, and third normal forms. Although I’ll be the first to admit that often formal normalization needs to take a back seat to pragmatic design or performance requirements, we’ll ignore that big caveat this week and plunge ahead.

Here, again, are the five data sources we located in the scavenger hunt:

  1. Nutritional content of food from the USDA.
  2. (Links to) population demographics of every major city in the US, courtesy of the US Census Bureau.
  3. The latest SEC filings (in RSS, no less) straight from the horse’s mouth.
  4. Historical gas prices, from the Energy Information Administration (which I had never heard of until writing this quiz).
  5. Juvenile arrest rates from the Office of Juvenile Justice and Delinquency Prevention (part of the Department of Justice).

So, which normal form are each of these sources in (and why)?

We’ll discuss the answers and a bit more about the implications of database normalization this weekend.

Bonus challenge!

For an extra challenge, pick one of the sources and define a fully normalized (i.e. 3NF) schema for it. There’s not in any way a “right” answer here, but if anyone’s brave enough to post their schemas, I’ll critique ’em when we go over the answers.

Got a question of your own?

As always, if you’ve got a question, puzzle, or challenge that you think would make a good question for this quiz, email me at jacob -at- jacobian.org.

We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now
  • malikyte

    I’m extremely glad to see the previous comment(s) were erased. Just because one person doesn’t understand something does not mean it isn’t important to others…

    I never realized how difficult reverse engineering a schema could be to determine what the actual field names (the only identifiable description, unless I go through the related documentation, if any exist) truly mean. If nothing else, I’ve learned that using fully descriptive names for fields and tables is important.

    I’m just going to quickly skim through these. Taking a look at the Nutritional content of food‘s Access database, I’m thinking that it’s in 3rd normal form. There are relationships and even (it appears) lookup tables. Without fully studying the fields and their related information, I won’t be sure what form it’s in, but I’ll assume 3rd. I did not look at the ASCII files, but did see that each file apparently represented a table, so they too would (I assume) be in 3rd normal form.

    I’m not even sure if the population demographics meets the requirements of the first normal form. They are separated by State from the index page, but after that they are simply zipped CSV files (at first glance).

    I’ve no idea on the SEC filings…
    I’ve also no idea on the gas prices (can’t understand Excel being in any form), but the link is pointing incorrectly. :)
    Juvenile arrest rates, also a spreadsheet…

    Most of these sites I would probably screenscrape and create my own database schema, using only the information I’d need, dropping anything that seemed unnecessary for the task at hand. Unfortunately, I can’t figure out what normal form they’re in. :(

  • They should be 1NF at a minimum if they are in ‘database format’ but it appeared some were not. Although it may be possible to normalise some of the data.

    No 4. malformed URL and I assume it is supposed to go: tonto.eia.doe

    If they are using databases; most people usually create a relational database to at least 3NF if they want integrity. The largest Relational Database I ever did was a College Computer Fault network tracking database.

    Well, when I opened the monster 77 MB sr19 Access database it looked like they had attempted to separate the data but had not actually added the relationships or fully completed normalisation of several of the tables, since there were missing primary keys, etc.