Code Safari: TWSS and Bayesian Classification of Twitter Searches

Perhaps my favourite usage of Ruby is as plumbing—small scripts to connect larger libraries in a way that does something interesting. Being able to whip these scripts up in an evening is an essential skill.

Today I came across an example of such plumbing in the twss gem.

Because automation knows no bounds… including lowbrow comedy.
TWSS is a simple Bayes classifer trained off of a Twitter #twss search.

twss is exactly the sort of project that gives life and character to an ecosystem. Let’s see how it works.

Finding an entry

The documentation shows us a potential entry point into exploring the code, in that the only exposed public interface is one method:

require 'twss'
TWSS("hey, did you resolve that ticket?") # => false
TWSS("not yet, it's taking a while") # => false
TWSS("well hurry up, you're not going fast enough") # => true

A captialized method name looks odd for ruby code, but is totally valid. It is typically done to “re-use” an already existing class or module name to provide default behaviour (Hpricot is another gem that does this). Looking inside lib/twss.rb (the required file is usually a good place to start), we see this cute trick: you can indeed have a method the same name as a Module.

# lib/twss.rb
module TWSS
  # ...
end

def TWSS(str)
  TWSS.classify(str)
end

That is a distraction from our core mission though. Let’s move on the guts of the program. Finding the classify method that the twss method delegates to (using your favourite find-in-files tool), we unconver the first major subsystem of this plumbing operation: the classifier gem.

# lib/twss/engine.rb
require 'classifier'

def initialize(options = {})
  @data_file = options[:data_file] || DATA_FILE
  @threshold ||= options[:threshold] || 5.0
  @classifier = load_classifier_from_file!(@data_file) || new_classifier
end

def classify(str)
  if basic_conditions_met?(str)
    c = @classifier.classifications(str)
    c[TRUE] - c[FALSE] > threshold
  else
    false
  end
end

def new_classifier
  Classifier::Bayes.new(TRUE, FALSE)
end

From the gem’s home page:

Classifier is a general module to allow Bayesian and
other types of classifications.

This is fairly self-explanatory code. twss hasn’t implemented any classifying logic itself, instead delegating to an existing framework to do the heavy lifting. This is a powerful technique, and should be used as often as possible. The best programmers know how to let others do the programming for them.

This is only one half of the equation though. Given that twss provides a pre-trained data file for classier, we need to find how it created this file in the first place. It is fairly obvious where this code resides: the only other two code files we haven’t looked at are named trainer.rb and tweet_collector.rb.

Fetching the data

Let’s start with Twitter. Being able to collect and process tweets is a particularly handy technique to have in your toolbox.

# lib/twss/tweet_collector.rb
require 'twitter'

# ...

def run
  o = File.open(filename, 'a')
  page, per_page = 1, 100
  begin
    Twitter::Search.new(search).per_page(per_page).page(page).each do |tweet|
      puts tweet.text
      o.puts tweet.text
    end
    page += 1
    sleep 2
  end while page * per_page < limit
  o.close
end

This suggests a simple API we can use for our own Twitter queries. Jump into irb and we can have a play around.

$ irb -rtwitter
irb> Twitter::Search.new('#ruby').each {|x| puts x.text }
TypeError: can't convert String into Hash
  from twitter-1.2.0/lib/twitter/api.rb:13:in `merge'

… well that’s awkward. It appears the twss code is broken. This type of thing tends to happen often in the fast moving world of ruby, and the first step is not to panic. The last commit for twss was August, 2010 so it is very possible the twitter gem has changed its API since then. In a perfect world twss would have specified a narrower range of versions of this gem that it was known to work with (typically a major and a minor version), but in this case we’ll have to do some research ourselves.

I searched GitHub for “twitter” to find the source repository for the gem. Scanning the README, it appears that version 1.0 broke a lot of backwards compatibility with earlier versions. In particular, this change looks exactly relevant to our code:

The Twitter::Search class has remained largely the same, however
it no longer accepts a query in its constructor. You can specify
a query using the #containing method, which is aliased to #q.

# Pre-1.0
Twitter::Search.new("query").fetch.first.text
# Post-1.0
Twitter::Search.new.q("query").fetch.first.text

Let’s try our search again with this new information.

irb> Twitter::Search.new.q('#ruby').each {|x| puts x.text }
#Ruby 1.9 is fast than #Python 2.7 so much................
The @bronxzooscobra was a big hit, maybe I'll do one for the snake in my house. Maybe #Zeus and #Ruby need their own twitter accounts too.
Amanhã tem curso de #ruby com os mestres @caironoleto e @cleitonfco, acho que seria melhor manerar no café. #not
RT @haacked: What's the gold standard in OSS project documentation? #Ruby #RoR #Python #PHP #Linux #GSoC2011
ANTIQUE RETRO DIAMOND 1.8ct RUBY WIDE RING 1940 size 10.5 #ring #ruby #diamond #gold #antique http://w.sns.ly/BQc0y4
http://www.pulist.net/max-amp-rubys-four-seasons-max-and-ruby.html #max #and #ruby #different Max &amp;amp; Ruby's Four Seasons (Max and Rub
#hippa #pci #security #ruby do you guys have any ideas for #securesmsvoting. ?  Does @americanidol do it?  Let's scale it to the world.
How many people have access to a cell phone for #voting?  In the world? 75%? SecureSMS voting! #ruby #sms #twitter @twitter #g8 #un #egypt
I'm watching @AJEnglish. Can't we fix world troubles with software?  Secure voting for everyone.  @AJListeningPost #sysadmin #ruby #facebook
Wow, #Ruby'sMoney kicks ass!!!
ICMembers -  Firmen-Network http://bit.ly/bjsPkl #Firma #Network #Skript #Software #Mitgliedschaft #Ruby #On #Rails #2.0.2 #MySQL
#Ruby Engagement Rings : Shop online Ruby Engagement Rings for Low Price. Compare Prices on Ruby Engagement Rings. find Sale items and more.
Watchr – More Than An Automated Test Runner http://tinyurl.com/3vsvklm #ruby
All this @antirez talk re #ruby perf has finally made me play w/ sinatra. Very cool, like #mojolicious which I love. Is mojo a #perl clone?
RT @ruby2itter: georgi/rack_dav - GitHub: HTTPGit Read-OnlyThis URL has Read+Write accessDismissOctotip: You've activated ... http://bit.ly/fDDw7o #ruby

Beautiful. We have learned a dirt simple way to get tweets from ruby code.

Training the classifier

We have a list of tweets, and we have code to classify any given string, but we are still missing a step in the middle: training our classifier. This would make an article in itself, but the basic concept is we need to give the classifier a set of data that matches each of our categories. The classifier can then use this information to make a guess how probable it is that a given text belongs to each one of the groups. The code for this is in the remaining file, trainer.rb.

# lib/twss/trainer.rb
# ...
puts "Training NON-TWSS strings..."
File.read(File.join(path, 'non_twss.txt')).each_line do |l|
  engine.train(TWSS::Engine::FALSE, strip_tweet(l))
end

puts "Training TWSS strings..."
File.read(File.join(path, 'twss.txt')).each_line do |l|
  engine.train(TWSS::Engine::TRUE, strip_tweet(l))
end
# ...

In this case there are only two categories (FALSE and TRUE, defined as ‘0’ and ‘1’ in engine.rb). We haven’t seen the files non_twss.txt and twss.txt yet, but if we search for them we find some code in the script/ directory that is populating them with Twitter search results for “:)” and “#twss” respectively.

# script/collect_twss.rb
require File.join(File.dirname(__FILE__), '../lib/twss')
require File.join(File.dirname(__FILE__), '../lib/twss/tweet_collector')

TWSS::TweetCollector.new('#twss', File.join(File.dirname(__FILE__), '../data/twss.txt')).run

Searching for “:)” is an interesting way to generate a known set of non-twss text, but it appears to work adequately for this type of data.

Putting It Together

We fetched a list of tweets featuring the #ruby tag, but many of them were to do with jewelery rather than code. Let’s combine the two libaries we have just learned about — classifier and twitter — to try and filter out tweets that aren’t about code. The hardest part will be training our classifier with solid data for the “code” and “jewelery” categories. Let’s try the same technique used by twss as a starting point, and simply use a twitter search for “#jewelery” and “#code”.

require 'twitter'
require 'classifier'

file_name = "classifier.dump"
categories = %w(jewelery programming)

classifier = Classifier::Bayes.new *categories
categories.each do |category|
  Twitter::Search.new.q("##{category}").per_page(500).each do |x|
    classifier.train(category, x.text)
  end
end

Twitter::Search.new.q("#ruby").per_page(25).each do |x|
  puts x.text
  puts "  => #{classifier.classifications(x.text)}"
  puts "  => #{classifier.classify(x.text)}"
  puts
end

We have combined Twitter and a bayesian classifier in just twenty lines of code. The results are promising for a first attempt, though clearly not perfect:

RT @haacked: What's the gold standard in OSS project documentation? #Ruby #RoR #Python #PHP #Linux #GSoC2011
  => {"Jewelery"=>-48.675845247282666, "Programming"=>-46.948563939601286}
  => Programming

ANTIQUE RETRO DIAMOND 1.8ct RUBY WIDE RING 1940 size 10.5 #ring #ruby #diamond #gold #antique http://w.sns.ly/BQc0y4
  => {"Jewelery"=>-52.61580762140005, "Programming"=>-55.656377490626184}
  => Jewelery

How many people have access to a cell phone for #voting?  In the world? 75%? SecureSMS voting! #ruby #sms #twitter @twitter #g8 #un #egypt
  => {"Jewelery"=>-70.68382032808533, "Programming"=>-73.42867953661468}
  => Jewelery

I'm watching @AJEnglish. Can't we fix world troubles with software?  Secure voting for everyone.  @AJListeningPost #sysadmin #ruby #facebook
  => {"Jewelery"=>-70.68382032808533, "Programming"=>-66.52092425763254}
  => Programming

#Ruby Engagement Rings : Shop online Ruby Engagement Rings for Low Price. Compare Prices on Ruby Engagement Rings. find Sale items and more.
  => {"Jewelery"=>-94.997002062587, "Programming"=>-86.66119916641394}
  => Programming

All this @antirez talk re #ruby perf has finally made me play w/ sinatra. Very cool, like #mojolicious which I love. Is mojo a #perl clone?
  => {"Jewelery"=>-107.46543897258468, "Programming"=>-101.50632914155935}
  => Programming

From here we could try applying some of the data quality algorithms twss uses, such as stripping out non-words and excluding short tweets, or we could look at improving our underlying data for the training set.

The ability to quickly and efficiently learn and use new libraries is a valuable skill to cultivate. Practice at combining those libraries, even for trivial applications, will make you a better programmer.

Here are some ideas for futher practice:

  • Investigate other sets of training data for our ruby tweet identifier.
  • Fork and patch the twss gem to work with the latest version of twitter.

Let us know how you go in the comments.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://brianthecoder.com brianthecoder

    I’ve found this on github https://github.com/livingsocial/ankusa and like it a bit better than the classifier gem. It has multiple backend persistence options and its pretty easy to add your own. It also calculates the priorities in more proper way.

  • http://www.tweetdynamics.com Abhi

    Hi,

    Interesting article! What is the accuracy of classification of the tweets?
    Did you try it on other classes besides “jewellery” and “programmming” ?
    I have a site – http://www.tweetdynamics.com that classifies tweets into various topics. However I still need to be able to support contextual analysis in the tweets to improve the accuracy of the classification.

    Abhi