Natural Language Processing with Ruby: n-grams

Natural Language Processing (NLP for short) is the process of processing written dialect with a computer. The processing could be for anything – language modelling, sentiment analysis, question answering, relationship extraction, and much more. In this series, we’re going to look at methods for performing some basic and some more advanced NLP techniques on various forms of input data.

One of the most basic techniques in NLP is n-gram analysis, which is what we’ll start with in this article!

Key Takeaways

N-gram analysis is a foundational NLP technique used for language modeling, which involves breaking down text into tuples of consecutive words (e.g., unigrams, bigrams, trigrams) to analyze grammatical structure and predict subsequent text elements.
The Brown Corpus, a significant collection of American English text from 1967, serves as an excellent data source for n-gram analysis due to its comprehensive tagging and accessibility for non-commercial use.
Ruby provides efficient methods for generating n-grams through simple class definitions, leveraging the `each_cons` method from the Enumerable module to create arrays of consecutive word groups from strings.
Extracting clean, usable data from the Brown Corpus involves removing tags and white spaces from the raw files, which can be automated with Ruby scripts to facilitate large-scale text analysis.
Practical applications of n-grams in Ruby include named-entity recognition and error detection in text inputs, demonstrating the utility of n-grams in various NLP tasks beyond basic language modeling.

What Is n-gram Analysis

Given a very simple sentence:

The quick brown fox jumped over the lazy dog.

For some grammatical analysis, it would be very useful to split this sentence up into consecutive pairs of words:

(The, quick), (quick, brown), (brown, fox), …

In the above example, we’ve split the sentence up into consecutive tuples of words. The examples above are 2-grams, more commonly known as “bigrams”. A 1-gram is called a “unigram”, and a 3-gram is called a “trigram”. For n-grams with 4 or more members, we generally just stick to calling it a 4-gram, 5-gram, etc. Some examples are in order:

Unigram: (The), (quick), (brown), ...
Bigram: (The, quick), (quick, brown), (brown, fox), ...
Trigram: (The, quick, brown), (quick, brown, fox), (brown, fox, jumps), ...
4-gram: (The, quick, brown, fox), (quick, brown, fox, jumps), (brown, fox, jumps, over), ...

How is this useful? Let’s say we have a huge collection of sentences (one of which is our “quick brown fox” sentence from above), and we have created a giant array of all the possible bigrams in it. We are then fed some user input that looks like this:

I really like brown quick foxes

If we also split this sentence up into bigrams, we could say with reasonable certainty that the bigram (brown, quick) is very likely grammatically incorrect because we normally encounter those two words the other way around. Applying n-gram analysis to text is a very simple and powerful technique used frequently in language modelling problems like the one we just showed, and as such is often the foundation of more advanced NLP applications (some of which we’ll explore in this series).

In the examples we’ve shown so far, the “grams” part of “n-grams” could be taken to mean “word”, but that doesn’t necessarily have to be the case. For example, in DNA sequencing, “grams” could mean one character in a base-pair sequence. Take the base-pair sequence “ATCGATTGAGCTCTAGCG” – we could make bigrams from this sequence to calculate which pairs are seen most often together, allowing us to predict what might come next in the sequence. For example, “A” is most often followed by “G”, so if we see an “A” next time, we could hypothesise that the next item in the sequence will be a “G”.

Choosing a Good Data Source

The one thing that the majority of n-gram analysis relies on is a data source. In the examples above, we needed that huge collection of sentences in order to assert the user-provided sentence had a mistake in it, We also needed enough data in our base-pair sequence in order to make our future DNA predictions.

What we need more generally is some good “training data”. We need some data we can trust that we can use to assert that our analysis is doing the right thing before we can take any old user-data and working with it.

For tasks specific to language analysis, it turns out building huge training data sets for n-gram analysis tasks like this have been undertaken by many academic institutions. These huge collections of textual data are called “text corpora” (that’s the plural, one of these collections is called a “text corpus”). These corpora are effectively tagged sets of written text.

In 1967, a seminal collection of American English was published, which is now known today as the “Brown Corpus”. This collection was compiled with the aim of creating an exhaustive record of as much American English as possible in use at the time. The result was a giant dataset of sentences which reflects, with reasonable accuracy, the frequency one might encounter certain words in normal American English sentences in 1967.

Following it’s initial publishing, it has since had “Part-of-Speech tagging” (often abbreviated as “POS tagging”) applied (hence being called a corpus now). POS tagging is a way of tagging each and every word in the entire corpus with it’s role (verb, noun). Wikipedia maintains a list of the tags the Brown Corpus uses, which is exhaustive to say the least!

The Brown Corpus remains one of the most popular corpora for analysis, as it’s free for non-commercial purposes and easy to obtain. There are lots of other corpora available, including Google’s n-grams Corpus which contains upwards of 150 billion words!

Obtaining the Brown Corpus

In this first part of the series, we’re going to obtain a copy of the Brown Corpus and do some simple n-gram analysis with it. The tagged version of the Brown Corpus is available in a ZIP file from http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/brown.zip.

Download and extract this ZIP file so that you are left with a brown directory containing a lot of .txt files:

sh $ ll brown total 10588 -rw-r--r-- 1 nathan nathan 20187 Dec 3 2008 ca01 -rw-r--r-- 1 nathan nathan 20357 Dec 3 2008 ca02 -rw-r--r-- 1 nathan nathan 20214 Dec 3 2008 ca03 (...)

Writing an n-grams Class

We need a way to generate a set of n-grams from a sentence. It should take a sentence, split it into chunks, and return consecutive members in groups of n long. Thankfully, this is trivial in Ruby:


def ngrams(n, string)
  string.split(' ').each_cons(n).to_a
end

The each_cons method is defined in the Enumerable module, and does exactly what we need by returning every consecutive possible set of n elements. It’s effectively a built-in method to do what we want once we have the input data into an Array format.

Let’s wrap this up in a really simple class:


class Ngram
  attr_accessor :options
  def initialize(target, options = { regex: / / })
    @target = target
    @options = options
  end
  def ngrams(n)
    @target.split(@options[:regex]).each_cons(n).to_a
  end
  def unigrams
    ngrams(1)
  end
  def bigrams
    ngrams(2)
  end
  def trigrams
    ngrams(3)
  end
end

We add a constructor method that takes the target string to work on, and allow people to optionally define how we break apart the given target before generating the n-grams (It defaults to splitting it up into words, but you could change it to splitting into characters, as an example.)

We then define three helper methods for our most common n-grams, namely unigrams, bigrams and trigrams.

Let’s see if it works:


bigrams = Ngram.new("The quick brown fox jumped over the lazy dog").bigrams
puts bigrams.inspect # = >[["The", "quick"], ["quick", "brown"], ["brown", "fox"], ["fox", "jumped"], ["jumped", "over"], ["over", "the"], ["the", "lazy"], ["lazy", "dog"]]

Success! We got exactly the result we were after and wrote barely any code to do it! Now let’s use this class to do something cool with the Brown Corpus we’ve downloaded.

Extracting Sentences From the Corpus

In its raw form, the corpus we obtained contains tagged sentences. In order to apply any sort of n-gram analysis on the contents of the corpus, we need to extract the raw sentences by removing the tags and keeping only the raw words.

A basic sentence in the tagged corpus looks something like this:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ”/” that/cs any/dti irregularities/nns took/vbd place/nn ./.

Words and tags are separated with a /. The tags describe the sort of words we are looking at, for example “noun” or a “verb”, and often the tense and the role of the word, for example “past tense” and “plural”.

Given a line, we know the first thing we need to do is get an array of all the words:


words = sentence.strip.split(' ')

We call strip just to clean off any preceding or trailing white-space we may have, and then we split the sentence up into words by using the space character as a delimiter.

Once we have that array of words, we need to split each word and keep only the first part:


words.map do |word|
  word.split('/').first
end.join(' ')

This will, on lines which contain words and tags, give us back a sentence without the tags and any trailing white-space removed.

One other thing we need to take care of is blank lines, which are littered throughout the corpus. Presuming we have a path to a corpus file, here’s a complete script to turn it into an array of sentences:


def sentences(path)
  File.open(@path) do |file|
    file.each_line.each_with_object([]) do |line, acc|
      stripped_line = line.strip
      unless stripped_line.nil? || stripped_line.empty?
        acc << line.split(' ').map do |word|
          word.split('/').first
        end.join(' ')
      end
    end
  end
end

One interesting method we call in here is each_with_object. This comes courtesy of the Enumerable module, and is effectively a specialized version of inject. Here’s an example without using either of those methods:


sentences = []
file.each_line do |line|
  # ...
  unless stripped_line.blank?
    # ...
    sentences << words.map do |word|
      # ...
    end
  end
end

You can see we now have to manage the sentences variable ourselves. The inject method gets us one step closer:


file.each_line.inject([]) do |acc, line|
  # ...
  unless stripped_line.blank?
    # ...
    acc << words.map do |word|
      # ...
    end
  end
  acc
end

However, we have to return the acc (short for “accumulator” which is a common nomenclature for the object one is building when using inject or other fold style variants.) If we don’t, the acc variable on the next loop run will become whatever we returned in the last loop. Thus, each_with_object was born, which takes care of always returning the object you are building without you having to worry about it.

Tip: One change between inject and each_with_object is that the acc parameter is passed to the block in the opposite order. Be sure to check the documentation before using these methods to be sure you’ve got it the right way around!

Now that we have a method written to get the sentences from a single file from the corpus, let’s wrap it up in a class:


class BrownCorpusFile
  def initialize(path)
    @path = path
  end
  def sentences
    @sentences ||= File.open(@path) do |file|
      file.each_line.each_with_object([]) do |line, acc|
        stripped_line = line.strip
        unless stripped_line.nil? || stripped_line.empty?
          acc << line.split(' ').map do |word|
            word.split('/').first
          end.join(' ')
        end
      end
    end
  end
end

Supercharging the Corpus Class

In order to do useful analysis, we need to write one final class to allow us to do the n-gram analysis on one or many of the files in the corpus.


class Corpus
  def initialize(glob, klass)
    @glob = glob
    @klass = klass
  end
  def files
    @files ||= Dir[@glob].map do |file|
      @klass.new(file)
    end
  end
  def sentences
    files.map do |file|
      file.sentences
    end.flatten
  end
  def ngrams(n)
    sentences.map do |sentence|
      Ngram.new(sentence).ngrams(n)
    end.flatten(1)
  end
  def unigrams
    ngrams(1)
  end
  def bigrams
    ngrams(2)
  end
  def trigrams
    ngrams(3)
  end
end

The class is super simple. We have a constructor that takes a glob pattern to select files in our corpus folder (the brown folder that you extracted previously), and a class to pass each file that is found to. If you’re not familiar with glob patterns, they’re often used in the terminal to wildcard things, eg. *.txt is a glob pattern that will find all files ending in .txt.

We then define files and sentences methods. The former uses the glob to find all matching files, loops over them, and creates a new instance of the class we passed to the constructor. The later calls the files method, and loops over those created classes, calling the sentences method on the class, and flattening the result into a single level deep Array.

Now for some convenience methods that wrap our Ngram class. The ngrams method calls sentences and returns an array of n-grams for those sentences. Note that when we call flatten, we ask it to only flatten one level, otherwise we’ll lose the n-gram Array. The unigrams, bigrams and trigrams methods are just helper methods to make things look nicer.

Doing Some n-gram Analysis

We have a basic tool ready, so let’s give it a go and do some analysis as a test on the corpus by trying to see how many proper nouns we can extract in a naïve way. This technique is called “named-entity recognition” in NLP lingo.


corpus = Corpus.new('brown/c*', BrownCorpusFile)

We start by creating a new instance of the Corpus class we wrote, and tell it the corpus we are looking at uses the BrownCorpusFile format.

Opening the very first corpus file in a text editor (ca01), there are lots of proper nouns (including places and names) mentioned with the word “of” beforehand. An example is the sentence “Janet Jossy of North Plains […]”. Most proper nouns will likely be 2 words or less, so let’s loop over all of the sentences in trigrams, looking for anything that has the first member “of” and the second member starting with a capital letter. We’ll include the third member of the trigram in the result if it also starts with a capital letter.

The output format we’re looking for is a Hash that looks something like the following format:


{
  "Some Noun" => 2
}

Where 2 in this example is the number of times the given proper noun occurred.

Here’s the solution to this problem in full:


capitals = ('A'..'Z')
results = Hash.new(0)
corpus.trigrams.each do |trigram|
  if trigram.first == "of" && capitals.include?(trigram[1].chars.first)
    result = [trigram[1]]
    if capitals.include?(trigram[2].chars.first)
      result << trigram[2]
    end
    results[result] += 1
  end
end

The example starts by defining a range which contains all the capital letters. This range is used later on to check whether the trigram members begin with a capital. We also create a results Hash, although we use Hash.new(0) to do so which sets all values to 0 by default. This allows us to increment values of each key without ever having to worry if the key was created and set before.

Then, the code starts building the results by looping over all of the trigrams in the corpus. For each trigram:

Check whether the first member is “of”.
Check whether the first character of the second member is a capital letter.
If both of these are true, prepare to store just the second member by creating an Array with just the second member in it.
If the third member also starts with a capital letter, add the third member to the Array.
Join the Array with a space character and store it in the results Hash, incrementing the value of that key by one.

Eventually we end up with the results Hash populated with key/value pairs. The keys are the proper nouns we’ve found and the values are the number of times we’ve seen that proper noun in the corpus. Here’s the top 10 after running that code over the entire Brown Corpus:


English: 22
Washington: 23
India: 23
New York: 26
Europe: 26
State: 35
American: 46
America: 61
God: 100
Af: 104

Barring “Af” (which if you delve into the corpus, you would discover is the name given to a function for accelerometer scale functions in physics – how thrilling!), the rest of the results are very interesting – though what conclusions you can draw from them is entirely up to you!

Source Code

The source code for everything we wrote in this article is available in the repository on GitHub: https://github.com/nathankleyn/rubynlp. Specifically, see the file for this part of the series at https://github.com/nathankleyn/rubynlp/blob/master/examples/part_one.rb.

Next Time

In the next part of this series, we’re going to look into being more intelligent with our n-grams by exploring Markov chaining. This fascinating application of mathematical probability allows us to generate pseudo-random text by approximating a language model.

Frequently Asked Questions (FAQs) on Natural Language Processing with Ruby and N-grams

What are the practical applications of N-grams in Natural Language Processing (NLP)?

N-grams are extensively used in NLP for various applications. They are used in text prediction where the next word in a sentence is predicted based on the previous ‘n’ words. This is commonly seen in predictive typing on smartphones. N-grams are also used in machine translation to maintain the context and semantics of the source language in the translated text. They are used in information retrieval systems to improve search accuracy. In speech recognition, N-grams help in understanding the context and reducing errors.

How does Ruby handle N-grams in NLP?

Ruby, with its powerful and flexible syntax, handles N-grams quite efficiently. It uses the ‘ngram’ gem to generate N-grams from the given text. The ‘ngram’ gem provides methods to generate unigrams, bigrams, trigrams, and so on. It also provides methods to calculate the frequency of each N-gram in the text. This makes it easier to analyze and process the text.

Can I use Ruby for large scale NLP projects?

Yes, Ruby can be used for large scale NLP projects. Although Python is more popular for NLP due to its extensive libraries, Ruby has its own set of powerful libraries and gems for NLP. Ruby’s syntax is clean and easy to understand, which makes it a good choice for large scale projects. However, for very large datasets, you might need to consider performance issues and choose the language and tools accordingly.

How can I improve the accuracy of my NLP model in Ruby?

The accuracy of an NLP model depends on various factors such as the quality of the dataset, the choice of NLP techniques, and the tuning of the model parameters. In Ruby, you can use various preprocessing techniques such as tokenization, stemming, and lemmatization to improve the quality of your dataset. You can also experiment with different NLP techniques and tune the model parameters to improve the accuracy.

What are some good resources to learn NLP with Ruby?

There are several good resources to learn NLP with Ruby. The Ruby NLP GitHub repository is a good starting point. It provides a list of Ruby libraries for NLP and text processing. The ‘nlp-with-ruby’ GitHub repository provides examples of various NLP tasks using Ruby. There are also several online tutorials and courses that cover NLP with Ruby.

How does N-grams help in text prediction?

N-grams help in text prediction by providing the context. In text prediction, the next word is predicted based on the previous ‘n’ words. N-grams provide a way to capture this context. For example, in a bigram model (where n=2), the next word is predicted based on the previous word. The prediction is done based on the probability of a word occurring after the previous word, which is calculated from the training data.

How can I use N-grams for text classification in Ruby?

In Ruby, you can use the ‘ngram’ gem to generate N-grams from the text. These N-grams can be used as features for text classification. The frequency of each N-gram in the text can be used as the feature value. You can then use any classification algorithm such as Naive Bayes, SVM, or logistic regression for the classification task.

What are the limitations of using N-grams in NLP?

While N-grams are a powerful tool in NLP, they have some limitations. They fail to capture long-range dependencies between words as they only consider ‘n’ previous words. They also suffer from the problem of data sparsity. As the value of ‘n’ increases, the number of possible N-grams increases exponentially, but the frequency of most of the N-grams in the training data will be very low. This leads to poor estimates of the probabilities of these N-grams.

Can I use N-grams for sentiment analysis in Ruby?

Yes, N-grams can be used for sentiment analysis in Ruby. The ‘ngram’ gem can be used to generate N-grams from the text. These N-grams can be used as features for sentiment analysis. The sentiment of a text can be determined based on the sentiment scores of the N-grams in the text.

How can I handle out-of-vocabulary words in N-grams?

Out-of-vocabulary words are words that are not present in the training data but appear in the test data. One way to handle out-of-vocabulary words in N-grams is to use a special token to represent all out-of-vocabulary words. Another way is to use smoothing techniques such as Laplace smoothing or Good-Turing smoothing, which assign a non-zero probability to out-of-vocabulary words.