Answers to Episode 1 (Scavenger Hunt)

Jacob Kaplan-Moss

Welcome back, scavengers!

If you missed it, this week’s challenge deals with finding computer-readable public data resources. Before getting to the answers, though, let’s talk a little about technique.

Finding public data

By law (in the US), much of the data produced by government agencies must be made available publicly. As you might expect, however, this is often the last thing an acronym’d agency wants to think about. Thus, even when data is made available on the web, it’s often only provided in formats that are difficult to parse on websites that are difficult to find.

Google does a pretty good job of penetrating this maze of government websites. Most of those who commented on the original question were able to find at least a few sources using Google. For me, at least, a good deal of poking around and trying searches with different keywords was required.

Once at the right place, most people had no trouble finding data in a form at least nominally parseable. That’s a good sign; in this age of Microsoft Office, I often have to fight IT departments to get access to data in a format suitable for parsing into a database. I glad to see that the people responding to my question have a good grasp of what constitutes a friendly format.

A few readers had some nice tips for finding government data:

  • malikyte pointed out out that “advanced filters can help quite a bit when you know what form of information youâ??re looking for, especially if a government organization is most likely involved. With [G]oogle, for instance, you can specify in the search terms: “sec filings” or “sec filings” — limiting your search results goes a long way in removing unimportant data.” I hadn’t realized that Google’s site: operator could be used on TLDs; thanks!
  • WindUpDoll easily found demographics for Wisconsin through her girlfriend who works for the city. I don’t in any way consider this cheating; nearly all of the cool work that we do at work is mad possible by an inside connection. If you’re in the business of dealing with public data, friends on the inside are key.
  • dmbfansim mentioned the redundantly-named-yet-useful, “The U.S. Government’s Official Web Portal”. More specifically, the reference center is an invaluable resource.

Finally, a wonderful clearinghouse for government data is; I found the questions for this quiz starting at that site.

The answers

Right, enough dallying; here are the answers. In some cases there were multiple sources found (by readers or by me); I’ve only provided one below:

  1. Nutritional content of food from the USDA.
  2. (Links to) population demographics of every major city in the US, courtesy of the US Census Bureau.
  3. The latest SEC filings (in RSS, no less) straight from the horse’s mouth.
  4. Historical gas prices, from the Energy Information Administration (which I had never heard of until writing this quiz).
  5. Juvenile arrest rates from the Office of Juvenile Justice and Delinquency Prevention (part of the Department of Justice).

Was it good for you, too?

Next time…

Come Tuesday, we’ll tackle a tool that’s perhaps the most powerful text-processing engine known to man: regular expressions. Now you have two problems.

See you then.