Roughly 90% of the data in the world was produced in the last 2 years – which should give you some idea of just exactly how much data is being accumulated the world around, especially by large companies like Google. The data field is so enormous that traditional methods of linking, searching and retrieving data don’t work any more. This is Big Data.
The term “Big Data” was popularized by Roger Magoulas from O’Reilly in 2005, although avid net trawlers have found evidence of the term being used occasionally as far back as 2001. It is a catchall term for the conundrums faced by the massive accumulation of data in disparate forms, most often found collated from internet sources.
Big data presents a number of problems to developers and analysts – not least among which is the need to file and compare information sources with wildly different structure and contents, as opposed to the old way of doing things with relational or object based data, where you know what a record looks like and relational links are concrete.
To be able to work with Big data effectively you need to be able to crawl petabytes of information in multiple nodes and reduce it to a human navigable format. Relationships are fuzzy and flexible, and data structures are not always known ahead of time. The parallel processing required is also an entirely new learning curve in itself for many.
With the emergence of enormous quantities of data came the need to analyze it and make use of it – in the majority of cases most companies around the world are sitting on top of stupendous amounts of information with no intelligent way to gain benefit from it.
To this end, Hadoop was born. Hadoop is a an Apache Foundation project that has produced a software library capable of abstracting and simplifying big data queries, handling latency, failure tolerance and asynchronous data availability. Most brilliantly, it can analyze unstructured data as well as structured – so when it comes to Big data, Hadoop is the killer app of the day.
Heavily in use around the world, Hadoop has still effectively been beta software, until very recently. On January the 4th 2012, the project announced the first stable release of Hadoop, 1.0, which marks a major milestone in public ability to handle big data. Since it’s an apache project, anyone can use Hadoop and build their own solutions on top of it, so the realm of big data and massive intelligent searching is more open to developers than ever before.
The software has been stable and in production use in many places for a long time, but the official release of a 1.0 version paves the way to easier adoption in the corporate environment as well as providing more assurance to developers hoping to tie big data intelligence into their apps.
We will be publishing some articles introducing the concepts of Big Data and concepts like Google’s MapReduce in more depth in the coming weeks, so keep coming back for your intro to the new world of intelligent computing.