The GDELT Project: the Largest Open-Access Database on Human Society & the Largest Open-Access Database on Worldwide News Media

5/29/2014

On May 29, 2014 Kalev H. Leetaru announced in the Google Cloud Platform Blog that the entire quarter-billion-record GDELT Event Database (Global Data on Events, Location and Tone) was available as a public dataset in Google BigQuery. The database contained records beginning in 1979. It monitored worldwide news media in over 100 languages.

He wrote:

"BigQuery is Google’s powerful cloud-based analytical database service, designed for the largest datasets on the planet. It allows users to run fast, SQL-like queries against multi-terabyte datasets in seconds. Scalable and easy to use, BigQuery gives you real-time insights about your data. With the availability of GDELT in BigQuery, you can now access realtime insights about global human society and the planet itself!

"You can take it for a spin here. (If it's your first time, you'll have to sign-up to create a Google project, but no credit card or commitment is needed).

"The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largest open-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

"From the very beginning, one of the greatest challenges in working with GDELT has been in how to interact with a dataset of this magnitude. Few traditional relational database servers offer realtime querying or analytics on data of this complexity, and even simple queries would normally require enormous attention to data access patterns and intricate multi-column indexing to make them possible. Traditional database servers require the creation of indexes over the most-accessed columns to speed queries, meaning one has to anticipate apriori how users are going to interact with a dataset. 

"One of the things we’ve learned from working with GDELT users is just how differently each of you needs to query and analyze GDELT. The sheer variety of access patterns and the number of permutations of fields that are collected together into queries makes the traditional model of creating a small set of indexes impossible. One of the most exciting aspects of having GDELT available in BigQuery is that it doesn’t have the concept of creating explicit indexes over specific columns – instead you can bring together any ad-hoc combination of columns and query complexity and it still returns in just a few seconds. This means that no matter how you access GDELT, what columns you look across, what kinds of operators you use, or the complexity of your query, you will still see results pretty much in near-realtime. 

"For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database. Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out “what if” hypotheses on global-scale trends in near-real time.

"On the technical side, BigQuery is completely turnkey: you just hand it your data and start querying that data – that’s all there is to it. While you could spin up a whole cluster of virtual machines somewhere in the cloud to run your own distributed clustered database service, you would end up spending a good deal of your time being a systems administrator to keep the cluster working and it wouldn’t support BigQuery’s unique capabilities. BigQuery eliminates all of this so all you have to do is focus on using your data, not spending your days running computer servers. 

"We automatically update the public dataset copy of GDELT in BigQuery every morning by 5AM ET, so you don’t even have to worry about updates – the BigQuery copy always has the latest global events. In a few weeks when GDELT unveils its move from daily updates to updating every 15 minutes, we’ll be taking advantage of BigQuery’s new stream updating capability to ensure the data reflects the state of the world moment-by-moment.

"Check out the GDELT blog for future posts where we will showcase how to harness some of BigQuery’s power to perform some pretty incredible analyses, all of them running entirely in the database system itself. For example, we’re particularly excited about the ability to use features like BigQuery’s new Pearson correlation support to be able to search for patterns across the entire quarter-billion-record dataset in just seconds. And we can’t wait to see what you do with it. . . ." 

Regarding GDELT, in April 2013 Leetaru and co-developer of the project, Philip A. Schrodt, presented an illustrated paper at the International Studies Association meetings held in San Francisco: GDELT: Global Data on Events, Location and Tone, 1979-2012.

Timeline Themes