Creating Machine Learning Systems with JRuby

Share this article

Creating Machine Learning Systems with JRuby

All the different programming languages out there seem to be a better fit for machine learning tasks than Ruby, right? Python has scikit-learn, Java has Weka, and there’s Shogun for machine learning in C++, just to name a few. On the other hand, Ruby has an excellent reputation for fast prototyping.

So, why shouldn’t you prototype machine learning systems with Ruby? Challenge accepted! In this tutorial, we will build a system that can automatically categorize BBC sports articles for you.

Oh, and we’ll do it in Ruby,OK? Well, that’s not entirely true—we will use JRuby and Java’s Weka library via the weka gem.

Preparation

First, install JRuby v9.0.0.0+. Then create an ml_with_jruby directory and put following Gemfile into it:

source 'https://rubygems.org'

# use your JRuby version here
ruby '2.3.1', engine: 'jruby', engine_version: '9.1.5.0'

gem 'weka'    # this provides us with the weka lib
gem 'scalpel' # used for text processing

In your JRuby environment, run bundle install to install the gems.

Next, download, the free dataset of BBC sport articles and move the unpacked article directories into a ./data/training directory.

Finally, move the last two articles of each sports type into a separate ./data/test directory.

Your project structure should look like this:

└── ml_with_jruby
    ├── data
    │   ├── test
    │   │   ├── athletics
    │   │   ├── cricket
    │   │   ├── football
    │   │   ├── rugby
    │   │   └── tennis
    │   └── training
    │       ├── athletics
    │       ├── cricket
    │       ├── football
    │       ├── rugby
    │       └── tennis
    └── Gemfile

The texts in the test directory will be our test files and will be classified with our trained classifier.

Wait…training, classification, classifier? Lots of terms here. Let’s have a quick look at what they mean.

What is Classification?

Classification means “labeling given data”. An article could, for instance, be labeled as Tennis or Cricket. These labels, Tennis and Cricket, are called classes. The algorithm that chooses the label for one of our articles is called the classifier.

Now, there are different types of classification problems: supervised and unsupervised. The first is often referred to as “Clustering”, where you don’t have any example data and you don’t know beforehand into which classes your algorithm will split your data.

Supervised means we have pre-labeled data, like our labeled articles. We use these labeled articles to train our classifier or in other words: to build a model that can decide on how to categorize new data. After the training, we can pass unlabeled articles to our classifier and it will give us a label for each of them.

This said, the three steps to build a system for supervised classification are:

  • Creating a training dataset from raw data
  • Training the classifier with the training dataset
  • Classifying new data with the trained classifier

Creating the Dataset

Let’s start with compiling our training data.

We need some example data to tell our classifier what different article types look like. Computers are smart, but we can’t expect them to take text and have a good gut feeling of what sports it is about. So, the first step is to transform our raw text into a representation with which our classifier can work. A computer should be good with numbers, so we will use a set of numbers that describe the properties of the text (the so-called features).

We have to find some features that can best divide our data into the different article types. We could calculate, for example, the total length of the text or the number of certain keywords in the text. At this step you can be creative, choosing whatever comes to your mind and makes sense. There are feature combinations that work well together, whereas others reduce the performance of the classifier. Once you have a pool of features, you can use algorithms to select the most valuable features. To keep it simple we won’t cover the feature selection in this tutorial and just use our good sense to select a small set of features.

Extracting Features From the Text

We’ll do the feature extraction in a FeatureExtractor class that takes a piece of text and returns a Hash of properties and their numeric representations. Let’s directly process the given text into paragraphs, sentences (using Scalpel), and words. We will need these soon enough as we fill up our features Hash:

# feature_extractor.rb

require 'scalpel'

class FeatureExtractor
  attr_reader :text, :paragraphs, :sentences, :words

  def initialize(text)
    @text       = text.strip
    @paragraphs = text.split(/\n{2,}/)
    @sentences  = Scalpel.cut(text)
    @words      = text.scan(/[\w'-]+/)
  end

  def features
   {} # to be implemented next :)
  end
end

First, we will add some obvious features: Count the appearance of words describing the sports itself, such as “tennis” in an article about tennis, “cricket” in an article about cricket, etc. (note that e.g. “athlet” counts “athletes” as well as “athletic”, and so on).

class FeatureExtractor
  # ...

  def features
    {
      athletics_hints_count: match_count('athlet'),
      cricket_hints_count:   match_count('cricket'),
      football_hints_count:  match_count('football'),
      rugby_hints_count:     match_count('rugby'),
      tennis_hints_count:    match_count('tennis')
    }
  end

  private

  def match_count(word)
    text.scan(/#{word}/i).count
  end
end

It might be interesting how many proper nouns, like names and teams, appear in the text. So we’ll add a capitalized_words_count feature.

Articles about e.g. tennis and athletics might be more likely to talk about women than, for example, football articles. As such, we’ll cover this in a feature that scans for male and female keywords and says which appear most often. Let’s call it gender_dominance.

Also, add some more generic text features, like text_length, sentence_count, paragraphs_count, and words_per_sentence_average.

When you read through a couple of our training articles, it seems like some people have to say more than others, so let’s count the quotes in the text, too.

You get the idea. Just try to extract some properties that probably can distinguish the content of an article from another.

We will add some additional features that indicate whether it’s more a team or individual sport, by counting the number of hints like pronouns (I, my, me vs. we, our, us) and a number_count feature that might indicate sports where scores or times are important.

With this, we are good for now and we can finish up our extractor class:

class FeatureExtractor
  # ...

  def features
    {
      athletics_hints_count:      match_count('athlet'),
      cricket_hints_count:        match_count('cricket'),
      football_hints_count:       match_count('football'),
      rugby_hints_count:          match_count('rugby'),
      tennis_hints_count:         match_count('tennis'),
      capitalized_words_count:    capitalized_words_count,
      gender_dominance:           gender_dominance,
      text_length:                text.length,
      sentences_count:            sentences.count,
      paragraphs_count:           paragraphs.count,
      words_per_sentence_average: words_per_sentence_average,
      quote_count:                quote_count,
      single_sport_hints_count:   terms_count(%w(I me my)),
      team_sport_hints_count:     terms_count(%w(we us our team)),
      number_count:               number_count
    }
  end

  private

  def match_count(word)
    text.scan(/#{word}/i).count
  end

  def capitalized_words_count
    words.count { |word| word.start_with?(word[0].upcase) }
  end

  def gender_dominance
    terms_count(%w(she her)) > terms_count(%w(he his)) ? 1 : 0
  end

  def terms_count(terms)
    words.count { |word| terms.include?(word.downcase) }
  end

  def words_per_sentence_average
    sentences.count.zero? ? 0 : (words.count / sentences.count)
  end

  def quote_count
    text.scan(/"[^"]+"/).count
  end

  def number_count
    text.scan(/\d+[\.,]\d+|\d+/).count
  end
end

Compiling the Training Dataset

We want to compile a dataset from our text features and save it as a file so that we can load it later on and train our classifier. We could also do it all in memory, but when storing the dataset as a file we can have a look into it and get a better understanding of what’s actually going on in this step. Weka provides a nice way for doing this with its Weka::Core::Instances class.

In a separate script, load our training texts and extract our features. Create an Instances object out of them and finally store our dataset on our disk. Before we start with this, let’s create another FileLoader and Text class that will nicely abstract our file loading and feature extraction from a given file.

The FileLoader will return all text files from the given data directory:

# file_loader.rb

class FileLoader
  attr_reader :data_directory

  def initialize(data_directory)
    @data_directory = File.expand_path("../#{data_directory}", __FILE__)
  end

  def files_for(article_type)
    Dir.glob("#{data_directory}/#{article_type}/*.txt")
  end
end

Our Text class allows us passing a text file and getting its features, by using the FeatureExtractor we created above:

# text.rb

require_relative 'feature_extractor'

class Text
  attr_reader :text

  def initialize(file)
    file_path = File.expand_path(file, __FILE__)

    # There seem to be some invalid UTF-8 characters in the texts,
    # so we remove them here.
    @text = File.read(file_path).encode!('UTF-8', 'UTF-8', invalid: :replace)
  end

  def features
    FeatureExtractor.new(text).features
  end
end

We can now use these classes to write a script for creating the training dataset. Create a new file called create_dataset.rb. First, create an empty Instances object that represents our training dataset. Add a numeric attribute for each feature and a nominal class attribute. We configure the different article types as possible class values:

# create_dataset.rb

require 'weka'
require_relative 'feature_extractor'
require_relative 'file_loader'
require_relative 'text'

article_types   = %i(athletics cricket football rugby tennis)
attribute_names = FeatureExtractor.new('').features.keys

dataset = Weka::Core::Instances.new.with_attributes do
  attribute_names.each do |name|
    numeric(name)
  end

  nominal(:class, values: article_types, class_attribute: true)
end
# ...

Next, calculate the features for all articles and add them to our instances object:

# ...
def feature_list_for(article_type)
  files = FileLoader.new('data/training').files_for(article_type)

  files.map do |file|
    # Remember that Text#features returns a Hash.
    # We only need the feature values.
    # Since the class value is still missing, we append the
    # article_type as the class value.
    Text.new(file).features.values << article_type
  end
end

article_types.each do |article_type|
  feature_list = feature_list_for(article_type)
  dataset.add_instances(feature_list)
end
# ...

Last, we can save all our calculated features to a file in the /generated directory. Instances allows saving and loading datasets to and from different file formats like CSV, JSON, ARFF, and the less common C.45 file format. Let’s pick ARFF (Attribute-Relation File Format) here, which was especially developed to work with datasets for machine learning tasks and is also nicely legible for humans:

# ...
dataset.to_arff('generated/articles.arff')

Run the script in your terminal to create the training dataset:

$ jruby create_dataset.rb # If you're using RVM, this is just `ruby...`

If you have a quick look into the generated .arff file, you’ll see a header with the customizable relation name and the defined attributes, followed by the actual data rows:

@relation Instances

@attribute athletics_hints_count numeric
@attribute cricket_hints_count numeric
# ...
@attribute class {athletics,cricket,football,rugby,tennis}

@data
1,0,0,0,0,47,1,1237,11,3,19,2,1,0,7,athletics
1,0,0,0,0,46,1,901,7,2,20,0,0,2,5,athletics
# ...

With our training dataset compiled we can now go ahead and train our classifier and classify our test articles.

Training the Classifier

There are loads of different built-in classifiers from which we can choose. We could use Bayes classifiers, Neural Networks, Logistic Regression, Decision Trees, and many more. For simplicity we will use the RandomForest classifier. With RandomForest, we get an easy to configure classifier that is based on Decision Trees and performs well for common problems.

It’s time for loading the training dataset and then training a RandomForest classifier. Let’s do it in a new file called run_classification.rb.

# run_classification.rb

require 'weka'

instances = Weka::Core::Instances.from_arff('generated/articles.arff')
instances.class_attribute = :class

classifier = Weka::Classifiers::Trees::RandomForest.new

# The -I option determines the number of decision trees that are used in each
# learning iteration, the default is 100, we increase it to 200 here to gain a
# better performance.
classifier.use_options('-I 200')

classifier.train_with_instances(instances)

Note that we have to manually set the class attribute after we loaded our dataset. This is necessary because there is no information about the position of our class attribute in our ARFF file (it doesn’t always have to be the last one!).

That was easy enough. Our test articles are already waiting for us!

Classifying Test Articles

We can now use our trained classifier to classify the (let’s pretend) unlabeled articles in our data/test directory.

Before we can pass our test articles to the classifier, we have to extract the same features from them as we did for our training texts. Luckily we can use our FileLoader and Text classes again:

# run_classification.rb

require 'weka'
require_relative 'file_loader' # <= added!
require_relative 'text'        # <= added!

# ...

article_types = %i(athletics cricket football rugby tennis)

def feature_list_for(article_type)
  files = FileLoader.new('data/test').files_for(article_type)

  files.map do |file|
    # Remember again that Text#features returns a Hash.
    # We only need the feature values.
    # The class value is still missing, but this time, we append a "missing"
    # as class value. You can use nil, '?' or Float::NAN.
    Text.new(file).features.values << '?'
  end
end

article_types.each do |article_type|
  feature_list = feature_list_for(article_type)

  feature_list.map do |features|
    label = classifier.classify(features)
    puts "* article about #{article_type} classified as #{label}"
  end
end

Here, we load our test texts and pass their extracted features to the classify method of our classifier. After classifying, print out our predicted classes to the stdout.

Run the script and have look at the output:

$ jruby run_classification.rb

* article about athletics classified as athletics
* article about athletics classified as athletics
* article about cricket classified as cricket
* article about cricket classified as cricket
* article about football classified as football
* article about football classified as football
* article about rugby classified as rugby
* article about rugby classified as rugby
* article about tennis classified as tennis
* article about tennis classified as tennis

Yay. Looks like all our articles got the right label!

This doesn’t mean that our classification system is perfect, though. When training classifiers, their performance can be evaluated by an approach called cross validation#k-fold_cross-validation). Weka also gives us a cross_validate method for our classifier.

Cross validation splits up the training dataset into N different parts with an equal number of instances. By default, it uses 10 splits. Then it takes 9 subsets to train the classifier and classifies the leftover set. This is done until each subset has been classified after training the classifier with the other 9 subsets. With this procedure, you get an idea of how good your classifier performs because you already know all the labels and can calculate certain measures.

Let’s look at the 10-fold cross validation for our classifier:

evaluation = classifier.cross_validate(folds: 10)
puts evaluation.summary

# Correctly Classified Instances         602               82.8061 %
# Incorrectly Classified Instances       125               17.1939 %
# Kappa statistic                          0.7708
# Mean absolute error                      0.1223
# Root mean squared error                  0.2281
# Relative absolute error                 39.9808 %
# Root relative squared error             58.3231 %
# Coverage of cases (0.95 level)          97.9367 %
# Mean rel. region size (0.95 level)      52.7373 %
# Total Number of Instances              727

In the first two lines we see, that our classifier classified only about 83% of our articles correctly. It’s actually not too bad for our small, contrived feature set. You can expect the performance to improve with a set of carefully selected features. It’s up to you, now—let the hunt for the best features begin!

Conclusion

In this article, we used JRuby to automatically categorize sports articles. We went through three basic steps for building a classification system: extracting features from raw texts, building a training dataset, and training a classifier. With our trained classifier, we classified unlabeled articles.

It looks like Ruby can also be your best friend for machine learning tasks and I really encourage you to check out the Weka framework and play around with it a bit. It’s not only a good exercise but also lets you discover that basic machine learning is actually not rocket science! Give it a try and let me know how it goes.

Paul GötzePaul Götze
View Author

Comics addict and fan of bad jokes. Aiming at polyglotism to find the right tools for all the small problems in the world. Currently solving some of them with Ruby, Python, and Elixir.

Emerging TechGlennGmachine learning
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week