Ruby and OpenCalais: Transform Data into Information

Fred Heath

calais_logo

A few weeks ago we saw how the OpenCalais semantic web service helped Connor, a budding young journalist, make sense of his data and reduce his workload. In doing so, Connor merely skimmed the surface of the great analytical potential that OpenCalais provides. This week, we’ll explore some of the more intricate capabilities that the service offers. Hold on to your seats for part 2: Data Detection Day!

Setting Up

Once again we’ll be using the DoverToCalais gem, a wrapper round the OpenCalais API with some tricks up it’s sleeve. You should know the drill by now:

    $ gem install dover_to_calais

You’ll recall that last time we used DoverToCalais, it was with the simple XML format and that was enough to satisfy our tagging needs. DoverToCalais can also be used with the OpenCalais rich JSON Output format which gives us a new wealth of data, mainly inter-entity relations.

When processing data with the rich JSON format, DoverToCalais uses Redis to store a data model of the data it processes. Follow the instructions here and get Redis up and running.

By default, DoverToCalais will use the local Redis instance 127.0.0.1, on port 6379 with database #6 (no password). If any of this proves inconvenient, it can be easily changed by modifying the constant DoverToCalais::REDIS.

Also, once again, don’t forget that DoverToCalais needs a working JRE in order to function properly.

N.B: The rich data analysis functionality is only available from DoverToCalais v0.2.1 onwards, so make sure you have the latest version!

What’s New Pussycat?

As Tom Jones would say, the main difference in DoverToCalais usage when using the rich output is that there’s no longer a need to do our response analysis in the callback (#to_calais method). The callback now only serves to let us know when the response has been processed. Once the callback returns, we know that we can find all our source data nicely modelled in Redis and we can access it ouside and independently of our EventMachine create->analyze->callback loop.

The only other difference is that we need to pass a :rich symbol to our #analyze_this method. DoverToCalais will do the rest.

The Model

No, not the classic Kraftwerk track, silly! This is the DoverToCalais data model, a number of Ohm model objects living on a Redis data store. These types of objects will be generated once DoverToCalais processes a data source.

Alt text

DoverToCalais::EntityModel has the following attributes

  • name – the entity name, e.g. Clark Kent, the Millenium Stadium, etc.
  • type – the entity type, e.g. Person, Location, etc.
  • calais_id – a unique id assigned by OpenCalais
  • relations – a set of generic relations connected to the entity
  • events – a set of events connected to the entity

DoverToCalais::EntityModel::RelationModel has the following attributes

  • subject – the entity applying the action
  • verb – an action
  • object – the entity receiving the action
  • detection – text that captures the essence of the relation
  • calais_id – a unique id assigned by OpenCalais

DoverToCalais::EntityModel::EventModel has the following attributes

  • calais_id – a unique id assigned by OpenCalais
  • infohash – an on-the-fly created Hash incorporating the event’s attributes and values. As the number of attributes depends on the type of event (e.g. MilitaryAction will have very different attributes to MovieRelease), the infohash is a good way to dynamically encapsulate an event’s attributes.

Connor’s Comeback

Remember Connor? Well, since he did so well at his last assignment his reputation round the office has become ‘Mr Data Analysis’! His colleague Debbie, an investigative journalist, comes to him with a problem:

“Word on the street is that there are some financial irregularities going on at the Alderwood Housing Authority and that somehow the mayor is involved. But I don’t have any leads on this, so don’t know where to begin. Heck, I don’t even know who Alderwood’s mayor is! You have the data and the skills Connor – can you help?”

“Sure thing!” quips Connor and whips out his trusty editor. “First thing to do is have the data analyzed. I suspect we’ll be needing the rich analysis capability!”

1  require 'dover_to_calais'
2  require 'em/throttled_queue'
3
4  DoverToCalais::flushdb
5   
6  EM.run do
7    # use Control + C to stop the EM
8    Signal.trap('INT')  { EventMachine.stop }
9    Signal.trap('TERM') { EventMachine.stop }
10  
11   DoverToCalais::API_KEY =  'my-opencalais-api-key'
12   data_dir = '/home/connor/data/Alderwood_News/'
13       
14   total_files = Dir[File.join(data_dir, '**', '*')].count { |file| File.file?(file) }
15   queue = EM::ThrottledQueue.new(2, 1)
16
17   dovers = []
18   Dir.foreach(data_dir) do |filename|
19     next if filename == '.' or filename == '..'
20     dover = DoverToCalais::Dover.new(data_dir + filename) 
21     dovers << dover
22     # push the dover on our throttled queue as well
23     queue.push(dover)
24   end
25
26   count = 0
27        
28   dovers.each do |dover|
29     dover.to_calais do |response|
30       if response.error
31         puts "*** Data source #{dover.data_src} error: #{response}"
32       else
33        count += 1
34        puts "finished #{count}"
35        puts "all done!" if (count >= total_files)
36       
37       end #if
38     end #block
39        
40     # because we told the queue to pop a maximum of two dovers per second
41     # we're not exceeding the OpenCalais limit so we'll get no errors
42     dovers.length.times { queue.pop  { |dover| dover.analyze_this(:rich)} }
43   end #each    
44 end

“That looks just like what you did the first time” says Debbie. “Pretty much”, replies Connor. “The only differences now are that I get the total amount of files to be processed (line #14) and then I increment a counter (line #33) every time a file has been analyzed. When my counter reaches the total, I know that everything’s been processed, so DoverToCalais has created it’s data model and I can start mining the data. Oh, and look how I’m passing an argument to analyze_this too!”

“I see” Debbie nods, “but what’s going on at line #4″? “Oh that”, says Connor, “is something I didn’t really have to do. It’s just that every time you richly analyze something with DoverToCalais, it gets added to the same data store. Me, I like starting with a clean slate, so I told DoverToCalais to clear it’s data store, that’s all.”

“Now let’s see what’s in all that data!”

Debbie the Data Detective

Connor creates a new file:

1  require 'dover_to_calais'
2 
3  Ohm.redis = Redis.new(DoverToCalais::REDIS)

“From now on, there’s nothing peculiar about what we do”, says Connor. “We are just manipulating standard Arrays, Hashes and Ohm objects. First, let’s find out who that mayor is.”

4  all_events = DoverToCalais::EntityModel::EventModel.all.to_a
5  mayors = all_events.select {|v| /[Mm]ayor/.match(v.info_hash['position'].to_s)
6    
7  mayors.each do |event|
8    puts event.info_hash
9  end

“Whoa”, cries Debbie when the code is ran. “Too many mayors and too many towns! Can we filter a bit more?” “Sure thing”, replies Connor:

5  mayors = all_events.select {|v| /[Mm]ayor/.match(v.info_hash['position'].to_s) &&
6                            /[Aa]lderwood/.match(v.info_hash['city'].to_s) && 
7                            /current/.match(v.info_hash['status'].to_s) }
8    
9  mayors.each do |event|
10    puts event.info_hash
11 end

Alt text

“That’s better”, Debbie exclaims. “I think we can safely assume that Rex Luthor is the Alderwood mayor. Let’s see what we can find on Rex. Can I drive?” With that, Debbie grabs the keyboard and types:

12  a_set = DoverToCalais::EntityModel.find(name: "Rex Luthor")
13  if a_set.size == 1
14    rex = a_set.first
15    rex.relations.each do |r|
16      puts "#{r.subject['name']} -- #{r.verb} -- #{r.object['name']} --    #{r.detection}"
17    end
18  else
19    puts "oh-oh, more than one Rex Luthors!"
20  end

Alt text

“Look at this last line” cries Debbie. “It seems the mayor pushed this Gonzales guy for appointment to the Housing Authority.” “You’re right” says Connor, “but first let me prettify the formatting a bit, this is hurting my eyes”. Connor quickly installs the tabularize gem, adds a require 'tabularize' at the top of the file and then changes the code.

12  a_set = DoverToCalais::EntityModel.find(name: "Rex Luthor")
13   if a_set.size == 1
14    rex = a_set.first
15
16    table = Tabularize.new
17    table << %w[Subject Verb Object Detection]
18    table.separator!
19
20    rex.relations.each do |r|
21         table << ["#{r.subject['name']}", 
22           "#{r.verb}",  
23          "#{r.object['name']}", 
24          "#{r.detection}" ]
25    end
26    puts table
27  else
28    puts "oh-oh, more than one Rex Luthors!"
29  end

Alt text

“That looks ready to print” says Debbie, “and I also found my missing link, the lead that connects the mayor to the Housing Authority. I’m sure if I use the same techniques to investigate this guy, I’ll come up with a ton of information about him! Thank you Connor and I’ll be buying you dinner tonight”

Conclusion

Connor helped out a colleague and got himself a free dinner at the same time. More importantly, he demonstrated how to use appropriate tools to transform senseless, unstructured data into sensible, actionable information. OpenCalais and DoverToCalais are works in progress, improving all the time. Coupled with the power and flexibility of Ruby and it’s eco-system, they provide a great tool-set for data mining and analysis.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://batman-news.com/ ruby

    Thanks for posting this – a very accessible illustration, and also amusing :)
    Making complex things feel simple – it’s a good thing.