Exploring Github’s Public Events with PHP and Google BigQuery
If you’ve been following along with my previous articles about Github’s API, you know that Github’s developers are doing their best to ease the pain of interacting with Github data. In this article, we’re going to take a look at the Github public events API and we will build a small demo along the way.
What are Github Public Events?
Github events are interactions made by users, like pushing, merging, creating repositories, etc. It’s like storing the history of Github. However, we are limited to the last 300 events, and that’s a problem if you want to search through the data of the whole history.
Ilya Grigorik wanted to keep track of his open source project, but due to the limitation of the Github public events API, he decided to create the GithubArchive website where he queries the Github API and stores the results. You can query the archive with links like http://data.githubarchive.org/2015-01-01-15.json.gz
and you’ll get an archive containing the JSON payload. This solution is not really efficient and doesn’t help with querying the history for data, so, when Google BigQuery was released, Grigorik moved all of his data to the new storage and made it public.
Why BigQuery?
Google BigQuery was created to solve the queries latency problem. When you’re dealing with big data, you need to have the right hardware and infrastructure to handle the load properly. Now, you have the opportunity to move your append-only database to Google’s infrastructure and see millions of data rows processed in seconds.
If you’re not familiar with Google APIs, be sure to check this guide about starting a new Google API project. After that, be sure to enable the BigQuery API on your Google developers console.
Setting Up the Environment
I will be using Laravel as my framework of choice, and Homestead Improved for my VM. We will grab our Google credentials from the config/google.php
file, and to follow Laravel’s security conventions I will store those credentials inside the .env
file.
// config/google.php
return [
'app_name' => env('app_name'),
'client_id' => env('client_id'),
'client_secret' => env('client_secret'),
'api_key' => env('api_key')
];
// .env
APP_DEBUG=true
app_name='Optional app name'
client_id='CHANGE IT'
client_secret='CHANGE IT'
api_key='CHANGE IT'
I will also create a new Service Provider to hold our Google services bindings.
// app/Providers/BigQueryServiceProvider.php
class BigQueryServiceProvider extends ServiceProvider
{
public function register()
{
$this->app->bind('GoogleClient', function () {
$googleClient = new \Google_Client();
$googleClient->setAccessToken(\Session::get("token"));
return $googleClient;
});
$this->app->bind('bigquery', function () {
$googleClient = \App::make('GoogleClient');
$bigquery = new \Google_Service_Bigquery($googleClient);
return $bigquery;
});
}
}
// config/app.php
// ...
'providers' => [
// ...
'App\Providers\BigQueryServiceProvider',
]
Since our application will require the user authorization to query the API, I tried to make the process easier by creating a Middleware for guarding our routes.
// app/Http/Middleware/GoogleLogin.php
class GoogleLogin
{
public function handle($request, Closure $next)
{
$ga = \App::make('\App\Services\GoogleLogin');
if (!$ga->isLoggedIn()) {
return redirect('login');
}
return $next($request);
}
}
// app/Kernel.php
protected $routeMiddleware = [
// ...
'google_login' => '\App\Http\Middleware\GoogleLogin'
];
Google BigQuery Query Console
Before we start working with BigQuery using the PHP library, lets start exploring the BigQuery Query console. The console has a Compose Query
button to let you write your queries. Underneath, we have a Query History
link which shows our queries history. Since our queries will take a while to respond, we can execute a query as a job and ask for its stats after a while – the Job History
link holds the list of the current jobs.
If you want to take a look at the table schema and explore the available fields, you can navigate to the left panel and select the Github timeline table.
Google BigQuery API and Github
As an example, let’s try to retrieve the top ten popular projects on Github. I will define popularity as the number of forks. The Google_Service_Bigquery
class is responsible for all types of interactions with the Google BigQuery API. I previously introduced the term jobs
, which means that we can execute a query in the background and we can fetch its status over time to see if it completed or not.
// app/Http/routes.php
Route::any('/topTen', ['middleware' => 'google_login', 'as' => 'topTen', 'uses' => 'BigQueryAPIController@topTen']);
// app/Http/Controllers/BigQueryAPIController.php
class BigQueryAPIController extends Controller
{
public function __construct()
{
$this->bigqueryService = \App::make("bigquery");
}
public function topTen()
{
$projectID = 'modular-robot-22';
$query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
" FROM githubarchive:github.timeline" .
" GROUP EACH BY repository_url" .
" ORDER BY max_forks DESC" .
" LIMIT 10";
$query = new \Google_Service_Bigquery_QueryRequest();
$query->setQuery($query_str);
$result = $this->bigqueryService->jobs->query($projectID, $query);
$fields = $result->getSchema()->getFields();
$rows = $result->getRows();
dump($fields);
foreach ($rows as $row) {
dump($row->getF());
}
}
}
The project ID can be found on the Google Developers Console. The query string is fetching the respository_url
, and MAX(repository_forks)
. The getSchema
method returns a Google_Service_Bigquery_TableSchema
instance which contains the selected column names from the response, and the getRows
method returns an iterable list of Google_Service_Bigquery_TableRow
. The output looks like the following.
The githubarchive:github.timeline
dataset has been deprecated and won’t be updated anymore. The new way to work with data is through a specific time range (year dataset, month dataset, day dataset); you can visit githubarchive to learn more about this. Now we must update our query above to work with last year’s data.
// app/Http/Controllers/BigQueryAPIController.php
public function topTen()
{
$projectID = 'modular-robot-22';
$query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
" FROM githubarchive:year.2014" .
" GROUP EACH BY repository_url" .
" ORDER BY max_forks DESC" .
" LIMIT 10";
//...
}
Top Ten Repositories by Language
We will adapt the previous query to pull repositories of a specific language. Our view markup will contain a form and a set of language options.
// resources/views/home.blade.php
<form action="/" method="POST" class="form-inline">
<label for="language">Language: </label>
<select name="language" id="language" class="form-control">
<option value="PHP">PHP</option>
<option value="JavaScript">JavaScript</option>
<option value="CSS">CSS</option>
<option value="Python">Python</option>
</select>
<input type="submit" class="form-control" value="Submit"/>
</form>
When the user submits the form, we retrieve the selected language and fill the condition in our SQL query. Because we will print our query result as a table, we will split the table header part from the table body.
// app/Http/Controllers/BigQueryAPIController.php
public function topTen()
{
if (!\Input::has('language')) {
return \View::make('home', ['repos' => null]);
}
$language = \Input::get('language', 'PHP');
$projectID = 'modular-robot-647';
$query_str = "SELECT repository_url, MAX(repository_forks) as max_forks" .
" FROM githubarchive:year.2014" .
" WHERE repository_language='$language'" .
" GROUP EACH BY repository_url" .
" ORDER BY max_forks DESC" .
" LIMIT 10";
$query = new \Google_Service_Bigquery_QueryRequest();
$query->setQuery($query_str);
$result = $this->bigqueryService->jobs->query($projectID, $query);
// getting the table header
$fields = $result->getSchema()->getFields();
// query response rows
$repos = $result->getRows();
return \View::make('home', ['repos' => $repos, 'tableHeader' => $fields]);
}
The only thing changed from the previous query is the where
condition, which filters languages by user input. When printing the result table I tried to keep things generic for most queries.
// resources/views/home.blade.php
@if($repos)
<table class="table table-hover">
<tr>
@foreach($tableHeader as $item)
<th>{{$item->getName()}}</th>
@endforeach
</tr>
@foreach($repos as $repo)
<tr>
@foreach($repo->getF() as $item)
<td>{{$item->getV()}}</td>
@endforeach
</tr>
@endforeach
</table>
@else
<h2>Nothing to show</h2>
@endif
The getF()
method returns the row content, while the getV()
method returns the cell content. You can inspect result values by dumping the class names and going through properties.
It looks like CodeIgniter was on top when I tested this, but this is not the case at the time of writing – Laravel is the most popular PHP framework on Github. Make sure to check the Github trending repositories page to see almost realtime data.
If you chose to work with data added post 01/01/2015
, the scheme will also contain nested records plus a JSON encoded payload field. This applies to working with the day datasets as well – you can use the Google BigQuery console to inspect the dataset schema. BigQuery has some useful JSON functions to work with JSON encoded data, be sure to check them out.
Conclusion
In this article, we introduced the Github Public Events and how you can use them through BigQuery. The final demo includes a basic usage example, but you may want to experiment with the API and get yourself ready for the annual Github data challenge with some cool Github data facts. What do you think about the Github Events API? Do you have any cool ideas for the data challenge? Let us know!