Amazon Adds New Service: Public Data Sets

Amazon announced the released of a new web service today that aims to facilitate easier access to open, public data sets. Public Data Sets on Amazon’s Web Services will attempt to make a wide range of public data available for free use by anyone. Users can interact with data sets via an Amazon EC2 machine image and only pay for their compute time — they won’t have to worry about storing, downloading, or cleaning the actual data.

According to Amazon business development manager Deepak Singh, the new program “significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities.”

Previously, utilizing the type of large data sets that Amazon plans to host for research purposes was a tedious, multi-step affair. Researchers needed to locate the data, download it, and then often times convert, clean, or customize it into a usable format for their needs. Sometimes just downloading the data is a huge barrier for researchers. One of the data sets on Amazon, for example, is a MySQL database from life sciences project Ensembl that maintains an “automated annotation on a number of eukaryotic genomes.” Their data set weighs in at a mammoth 650 gigabytes and contains 31,000 files. The technical logistics of wrangling a database that large would be an insurmountable hurdle for many researchers with limited resources.

Now, the data will be available for use across the entire ecosystem of Amazon web services with almost no work on the part of researchers to get up and running. Amazon hopes that developers will create public tools to analyze the data and mash it up with other sources, and that by making data more easily available to a wider range of people, the project will help to foster innovation.

Amazon has a wide range of public data sets available now and plans to add more in the future.

At launch, or shortly after, Amazon’s service offers human genome and DNA sequencing data from Ensembl, and the National Center for Biotechnology Information; chemistry data from Indiana University; and economic data from the US Census Bureau, the Bureau of Labor Statistics, the Bureau of Transportation Services, and the Bureau of Economic Analysis.

How will you use the data Amazon is making available? What types of mashups would you likes to see created? And what sort of data would you like to see added? Let us know in the comments.