Natural Language Processing with Ruby: n-grams

Blackboard "NLP"

Natural Language Processing (NLP for short) is the process of processing written dialect with a computer. The processing could be for anything – language modelling, sentiment analysis, question answering, relationship extraction, and much more. In this series, we’re going to look at methods for performing some basic and some more advanced NLP techniques on various forms of input data.

One of the most basic techniques in NLP is n-gram analysis, which is what we’ll start with in this article!

What Is n-gram Analysis

Given a very simple sentence:

The quick brown fox jumped over the lazy dog.

For some grammatical analysis, it would be very useful to split this sentence up into consecutive pairs of words:

(The, quick), (quick, brown), (brown, fox), …

In the above example, we’ve split the sentence up into consecutive tuples of words. The examples above are 2-grams, more commonly known as “bigrams”. A 1-gram is called a “unigram”, and a 3-gram is called a “trigram”. For n-grams with 4 or more members, we generally just stick to calling it a 4-gram, 5-gram, etc. Some examples are in order:

  • Unigram: (The), (quick), (brown), ...
  • Bigram: (The, quick), (quick, brown), (brown, fox), ...
  • Trigram: (The, quick, brown), (quick, brown, fox), (brown, fox, jumps), ...
  • 4-gram: (The, quick, brown, fox), (quick, brown, fox, jumps), (brown, fox, jumps, over), ...

How is this useful? Let’s say we have a huge collection of sentences (one of which is our “quick brown fox” sentence from above), and we have created a giant array of all the possible bigrams in it. We are then fed some user input that looks like this:

I really like brown quick foxes

If we also split this sentence up into bigrams, we could say with reasonable certainty that the bigram (brown, quick) is very likely grammatically incorrect because we normally encounter those two words the other way around. Applying n-gram analysis to text is a very simple and powerful technique used frequently in language modelling problems like the one we just showed, and as such is often the foundation of more advanced NLP applications (some of which we’ll explore in this series).

In the examples we’ve shown so far, the “grams” part of “n-grams” could be taken to mean “word”, but that doesn’t necessarily have to be the case. For example, in DNA sequencing, “grams” could mean one character in a base-pair sequence. Take the base-pair sequence “ATCGATTGAGCTCTAGCG” – we could make bigrams from this sequence to calculate which pairs are seen most often together, allowing us to predict what might come next in the sequence. For example, “A” is most often followed by “G”, so if we see an “A” next time, we could hypothesise that the next item in the sequence will be a “G”.

Choosing a Good Data Source

The one thing that the majority of n-gram analysis relies on is a data source. In the examples above, we needed that huge collection of sentences in order to assert the user-provided sentence had a mistake in it, We also needed enough data in our base-pair sequence in order to make our future DNA predictions.

What we need more generally is some good “training data”. We need some data we can trust that we can use to assert that our analysis is doing the right thing before we can take any old user-data and working with it.

For tasks specific to language analysis, it turns out building huge training data sets for n-gram analysis tasks like this have been undertaken by many academic institutions. These huge collections of textual data are called “text corpora” (that’s the plural, one of these collections is called a “text corpus”). These corpora are effectively tagged sets of written text.

In 1967, a seminal collection of American English was published, which is now known today as the “Brown Corpus”. This collection was compiled with the aim of creating an exhaustive record of as much American English as possible in use at the time. The result was a giant dataset of sentences which reflects, with reasonable accuracy, the frequency one might encounter certain words in normal American English sentences in 1967.

Following it’s initial publishing, it has since had “Part-of-Speech tagging” (often abbreviated as “POS tagging”) applied (hence being called a corpus now). POS tagging is a way of tagging each and every word in the entire corpus with it’s role (verb, noun). Wikipedia maintains a list of the tags the Brown Corpus uses, which is exhaustive to say the least!

The Brown Corpus remains one of the most popular corpora for analysis, as it’s free for non-commercial purposes and easy to obtain. There are lots of other corpora available, including Google’s n-grams Corpus which contains upwards of 150 billion words!

Obtaining the Brown Corpus

In this first part of the series, we’re going to obtain a copy of the Brown Corpus and do some simple n-gram analysis with it. The tagged version of the Brown Corpus is available in a ZIP file from http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/brown.zip.

Download and extract this ZIP file so that you are left with a brown directory containing a lot of .txt files:

sh
$ ll brown
total 10588
-rw-r--r-- 1 nathan nathan 20187 Dec 3 2008 ca01
-rw-r--r-- 1 nathan nathan 20357 Dec 3 2008 ca02
-rw-r--r-- 1 nathan nathan 20214 Dec 3 2008 ca03
(...)

Writing an n-grams Class

We need a way to generate a set of n-grams from a sentence. It should take a sentence, split it into chunks, and return consecutive members in groups of n long. Thankfully, this is trivial in Ruby:

def ngrams(n, string)
  string.split(' ').each_cons(n).to_a
end

The each_cons method is defined in the Enumerable module, and does exactly what we need by returning every consecutive possible set of n elements. It’s effectively a built-in method to do what we want once we have the input data into an Array format.

Let’s wrap this up in a really simple class:

class Ngram
  attr_accessor :options

  def initialize(target, options = { regex: / / })
    @target = target
    @options = options
  end

  def ngrams(n)
    @target.split(@options[:regex]).each_cons(n).to_a
  end

  def unigrams
    ngrams(1)
  end

  def bigrams
    ngrams(2)
  end

  def trigrams
    ngrams(3)
  end
end

We add a constructor method that takes the target string to work on, and allow people to optionally define how we break apart the given target before generating the n-grams (It defaults to splitting it up into words, but you could change it to splitting into characters, as an example.)

We then define three helper methods for our most common n-grams, namely unigrams, bigrams and trigrams.

Let’s see if it works:

bigrams = Ngram.new("The quick brown fox jumped over the lazy dog").bigrams
puts bigrams.inspect # = >[["The", "quick"], ["quick", "brown"], ["brown", "fox"], ["fox", "jumped"], ["jumped", "over"], ["over", "the"], ["the", "lazy"], ["lazy", "dog"]]

Success! We got exactly the result we were after and wrote barely any code to do it! Now let’s use this class to do something cool with the Brown Corpus we’ve downloaded.

Extracting Sentences From the Corpus

In its raw form, the corpus we obtained contains tagged sentences. In order to apply any sort of n-gram analysis on the contents of the corpus, we need to extract the raw sentences by removing the tags and keeping only the raw words.

A basic sentence in the tagged corpus looks something like this:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ”/” that/cs any/dti irregularities/nns took/vbd place/nn ./.

Words and tags are separated with a /. The tags describe the sort of words we are looking at, for example “noun” or a “verb”, and often the tense and the role of the word, for example “past tense” and “plural”.

Given a line, we know the first thing we need to do is get an array of all the words:

words = sentence.strip.split(' ')

We call strip just to clean off any preceding or trailing white-space we may have, and then we split the sentence up into words by using the space character as a delimiter.

Once we have that array of words, we need to split each word and keep only the first part:

words.map do |word|
  word.split('/').first
end.join(' ')

This will, on lines which contain words and tags, give us back a sentence without the tags and any trailing white-space removed.

One other thing we need to take care of is blank lines, which are littered throughout the corpus. Presuming we have a path to a corpus file, here’s a complete script to turn it into an array of sentences:

def sentences(path)
  File.open(@path) do |file|
    file.each_line.each_with_object([]) do |line, acc|
      stripped_line = line.strip

      unless stripped_line.nil? || stripped_line.empty?
        acc << line.split(' ').map do |word|
          word.split('/').first
        end.join(' ')
      end
    end
  end
end

One interesting method we call in here is each_with_object. This comes courtesy of the Enumerable module, and is effectively a specialized version of inject. Here’s an example without using either of those methods:

sentences = []
file.each_line do |line|
  # ...
  unless stripped_line.blank?
    # ...
    sentences << words.map do |word|
      # ...
    end
  end
end

You can see we now have to manage the sentences variable ourselves. The inject method gets us one step closer:

file.each_line.inject([]) do |acc, line|
  # ...
  unless stripped_line.blank?
    # ...
    acc << words.map do |word|
      # ...
    end
  end
  acc
end

However, we have to return the acc (short for “accumulator” which is a common nomenclature for the object one is building when using inject or other fold style variants.) If we don’t, the acc variable on the next loop run will become whatever we returned in the last loop. Thus, each_with_object was born, which takes care of always returning the object you are building without you having to worry about it.

Tip: One change between inject and each_with_object is that the acc parameter is passed to the block in the opposite order. Be sure to check the documentation before using these methods to be sure you’ve got it the right way around!

Now that we have a method written to get the sentences from a single file from the corpus, let’s wrap it up in a class:

class BrownCorpusFile
  def initialize(path)
    @path = path
  end

  def sentences
    @sentences ||= File.open(@path) do |file|
      file.each_line.each_with_object([]) do |line, acc|
        stripped_line = line.strip

        unless stripped_line.nil? || stripped_line.empty?
          acc << line.split(' ').map do |word|
            word.split('/').first
          end.join(' ')
        end
      end
    end

  end
end

Supercharging the Corpus Class

In order to do useful analysis, we need to write one final class to allow us to do the n-gram analysis on one or many of the files in the corpus.

class Corpus
  def initialize(glob, klass)
    @glob = glob
    @klass = klass
  end

  def files
    @files ||= Dir[@glob].map do |file|
      @klass.new(file)
    end
  end

  def sentences
    files.map do |file|
      file.sentences
    end.flatten
  end

  def ngrams(n)
    sentences.map do |sentence|
      Ngram.new(sentence).ngrams(n)
    end.flatten(1)
  end

  def unigrams
    ngrams(1)
  end

  def bigrams
    ngrams(2)
  end

  def trigrams
    ngrams(3)
  end
end

The class is super simple. We have a constructor that takes a glob pattern to select files in our corpus folder (the brown folder that you extracted previously), and a class to pass each file that is found to. If you’re not familiar with glob patterns, they’re often used in the terminal to wildcard things, eg. *.txt is a glob pattern that will find all files ending in .txt.

We then define files and sentences methods. The former uses the glob to find all matching files, loops over them, and creates a new instance of the class we passed to the constructor. The later calls the files method, and loops over those created classes, calling the sentences method on the class, and flattening the result into a single level deep Array.

Now for some convenience methods that wrap our Ngram class. The ngrams method calls sentences and returns an array of n-grams for those sentences. Note that when we call flatten, we ask it to only flatten one level, otherwise we’ll lose the n-gram Array. The unigrams, bigrams and trigrams methods are just helper methods to make things look nicer.

Doing Some n-gram Analysis

We have a basic tool ready, so let’s give it a go and do some analysis as a test on the corpus by trying to see how many proper nouns we can extract in a naïve way. This technique is called “named-entity recognition” in NLP lingo.

corpus = Corpus.new('brown/c*', BrownCorpusFile)

We start by creating a new instance of the Corpus class we wrote, and tell it the corpus we are looking at uses the BrownCorpusFile format.

Opening the very first corpus file in a text editor (ca01), there are lots of proper nouns (including places and names) mentioned with the word “of” beforehand. An example is the sentence “Janet Jossy of North Plains [...]“. Most proper nouns will likely be 2 words or less, so let’s loop over all of the sentences in trigrams, looking for anything that has the first member “of” and the second member starting with a capital letter. We’ll include the third member of the trigram in the result if it also starts with a capital letter.

The output format we’re looking for is a Hash that looks something like the following format:

{
  "Some Noun" => 2
}

Where 2 in this example is the number of times the given proper noun occurred.

Here’s the solution to this problem in full:

capitals = ('A'..'Z')
results = Hash.new(0)

corpus.trigrams.each do |trigram|
  if trigram.first == "of" && capitals.include?(trigram[1].chars.first)
    result = [trigram[1]]

    if capitals.include?(trigram[2].chars.first)
      result << trigram[2]
    end

    results[result] += 1

  end
end

The example starts by defining a range which contains all the capital letters. This range is used later on to check whether the trigram members begin with a capital. We also create a results Hash, although we use Hash.new(0) to do so which sets all values to 0 by default. This allows us to increment values of each key without ever having to worry if the key was created and set before.

Then, the code starts building the results by looping over all of the trigrams in the corpus. For each trigram:

  • Check whether the first member is “of”.
  • Check whether the first character of the second member is a capital letter.
  • If both of these are true, prepare to store just the second member by creating an Array with just the second member in it.
  • If the third member also starts with a capital letter, add the third member to the Array.
  • Join the Array with a space character and store it in the results Hash, incrementing the value of that key by one.

Eventually we end up with the results Hash populated with key/value pairs. The keys are the proper nouns we’ve found and the values are the number of times we’ve seen that proper noun in the corpus. Here’s the top 10 after running that code over the entire Brown Corpus:

English: 22
Washington: 23
India: 23
New York: 26
Europe: 26
State: 35
American: 46
America: 61
God: 100
Af: 104

Barring “Af” (which if you delve into the corpus, you would discover is the name given to a function for accelerometer scale functions in physics – how thrilling!), the rest of the results are very interesting – though what conclusions you can draw from them is entirely up to you!

Source Code

The source code for everything we wrote in this article is available in the repository on GitHub: https://github.com/nathankleyn/rubynlp. Specifically, see the file for this part of the series at https://github.com/nathankleyn/rubynlp/blob/master/examples/part_one.rb.

Next Time

In the next part of this series, we’re going to look into being more intelligent with our n-grams by exploring Markov chaining. This fascinating application of mathematical probability allows us to generate pseudo-random text by approximating a language model.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Anonymous

    https://github.com/seamusabshere/fuzzy_match uses bigrams! (shameless plug)

  • Elad Meidar

    Awesome. simply awesome

  • MadBomber

    Nicely done; looking forward to reading the rest. fyi: article sample code does not match repo code.

  • Kendall Gifford

    In one of the earlier listings where you’re still building to the final file parsing code, you have a typo:

    def sentences(path)
    File.open(@path) do |file|

    The second line should be referencing the “path” method argument, not the “@path” instance variable.

  • Kendall Gifford

    I’d also propose that the Ngram class’s #initialize method optionally take a block which, if present, is used to split the target text string into an array of words (instead of the String#split method). This refactor would allow you to simplify the code that reads sentences from corpus text files, deferring the removal of the tags off individual words by using the proposed block form of Ngram#initialize.

    The advantage here is that you’re not splitting all the text into words to remove the tags, reassembling them into sentences, and then re-splitting them again later.