The entire quarter-billion-record GDELT Event Database is now available as a public dataset in Google BigQuery.
This is the sentence at the top of the release post, and it’s a really big deal.
The Global Database of Events, Language and Tone is one of the largest datasets on the planet. It is the quantitative database of human society, relying on thousands of news sources from every corner of the globe dating back to 1979.
It was thought up by Kalev Leetaru, who is also the author of the Google release post referenced above. The GDELT covers all countries globally spanning a third of a century, and consists of daily updates during that time period. Hundreds of millions of records, each with 59 fields narrating into detail the actors and events having taken place. Every record is georeferenced, so you can globally place it, and all actors are tagged with appropriate ethnic and religious affiliation. All this – free and available for your perusal, and you don’t even have to have the computing power to handle it.
Google BigQuery, “Google’s powerful cloud-based analytical database service” is, basically, the world’s fastest SQL engine, and it’s completely free for any and all uses of GDELT. Due to the sheer power of BigQuery, you can get results on GDELT queries in near real-time and any permutation of fields and values you can think of won’t be enough to bog it down to a halt – unless you really mess things up and go against the grain. If you deal with databases in any regards and the following paragraph doesn’t send chills down your spine, you’re probably dead:
For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database. Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out “what if” hypotheses on global-scale trends in near-real time.
Currently, GDELT on BigQuery is updated daily, but there are plans to move to a near real-time update schedule – updating the dataset every 15 minutes.
Before you get too excited – there is a limit, but it’s not a quota you’ll easily hit. To read more about free quotas, see here and keep in mind you can always pay for more if you actually develop a commercially viable application on top of this data.
Running a sample query
You can start playing around with GDELT on BigQuery by visiting this URL – you might have to make a new project if you don’t have one already. After gaining access, you should see a screen not unlike the following:
To run the sample query from the release post, click the red “Compose Query” button, paste the SQL into the newly opened textarea and click “Run Query”. Mine took 20 seconds, yours may take anywhere from 5 to 30, but you should get a result not unlike this one:
Using it with PHP
To see how you can use BigQuery and PHP, stay tuned on SitePoint for articles that target that specific combination – they’re coming in June. For now, you can check out this excellent Lever.rs post post that runs through it in a very approachable manner.
In a nutshell, you need to use the PHP library Google provides and install it with Composer or through alternative means. Once done, you need to include the lib in your project as you normally would, through Composer’s autoload file, and you can start using the API.
For a full introduction on how to get started, obtain API keys and get deep into using Google APIs for access to BigQuery and similar services, please see this guide. You can also RSS subscribe to the Google App Engine tag and you’ll be instantly notified of new posts in that category.
The GDELT project has long been an admirable one, but the advent of its BigQuery release marked a new milestone – a general availability to the public never before seen. Everyone now has the ability to query the world’s history, and we can’t wait to see what you build – judging by Kalev, the author, neither can the GDELT team. They’re inviting you to share your queries and experiments with them and if impressive enough, they just might share them with the world on the official blog. If you do come up with anything stunning, let us know – we’re keen to publish tutorials and analyses on it!
Frequently Asked Questions (FAQs) about Google’s BigQuery and GDELT
What is Google’s BigQuery and how does it work?
Google’s BigQuery is a web service from Google that is used for handling and analyzing big data. It’s part of the Google Cloud Platform. BigQuery works on the concept of SQL and doesn’t require any infrastructure to manage or a database administrator, making it a fully-managed service. It allows users to run SQL-like queries on multiple terabytes of data simultaneously. It’s designed to be highly scalable and flexible, allowing for fast analysis of large datasets.
What is GDELT and how is it related to BigQuery?
GDELT (Global Database of Events, Language, and Tone) is a massive, global database that monitors the world’s news media from nearly every corner of the world in over 100 languages. Google’s BigQuery provides free access to the GDELT database. This means that anyone can use BigQuery to analyze the entire GDELT database, making it a powerful tool for understanding global human society.
How can I access GDELT through BigQuery?
To access GDELT through BigQuery, you need to have a Google Cloud account. Once you have an account, you can go to the BigQuery console. From there, you can query the GDELT dataset directly using SQL commands. The GDELT data is updated every 15 minutes, so you can always access the most recent data.
What is the cost of using BigQuery?
BigQuery charges for data storage, streaming inserts, and for querying data. However, there is a free tier available. The BigQuery Sandbox allows users to use BigQuery for free, up to a certain limit. The free tier includes 1 TB of querying per month and 10 GB of storage.
How can I optimize my queries in BigQuery?
There are several ways to optimize your queries in BigQuery. One way is to use the
SELECT statement to specify only the columns you need. Another way is to use partitioned tables, which can reduce the amount of data read by a query, reducing costs and increasing speed. You can also use the
LIMIT clause to limit the amount of data returned by a query.
What kind of data does GDELT track?
GDELT tracks a wide range of data from the world’s news media. This includes events, counts, quotes, images, persons, organizations, themes, sources, and tones. It monitors broadcast, print, and web news from across the globe in over 100 languages.
How can I use BigQuery and GDELT for my research?
BigQuery and GDELT can be used for a wide range of research purposes. For example, you can use it to analyze global trends, track the spread of information, study the impact of events on public sentiment, and much more. The possibilities are virtually limitless.
Can I use my own data with BigQuery?
Yes, you can upload your own data to BigQuery and use it for analysis. BigQuery supports CSV, JSON, Avro, and other formats. You can also use BigQuery Data Transfer Service to automate data movement from SaaS applications to BigQuery.
How secure is my data in BigQuery?
BigQuery is designed with multiple layers of security, including secure data transmission, encryption at rest, and identity and access management. Google also uses several robust security measures to protect your data, including independent audits of their data centers and a dedicated security team.
Can I use BigQuery for real-time analytics?
Yes, BigQuery is designed for real-time analytics. It allows you to analyze real-time data by using its streaming functionality. You can insert up to 100,000 rows of data per second per table, and the data is available for querying within a few seconds.