Ruby
Article

Counting Real Words with Ruby

By Abder-Rahman Ali

Ruby on Medicine

Hands showing numbers from one to five.

.

Wait a minute, do you mean this blog post is just about counting words in a document? First, it is easy enough, and second, this can be done by any word processor on the fly.

You are right in that it is about counting words. However, the aim of this blog post is to show how flexible Ruby can be in meeting our requirements when counting what we determine are considered words. This is in opposition to the word processors we use, which likely will not be able to count based on such criteria.

Let me clarify this point a bit further. When a word processor is counting words, it takes the white space as a delimiter. As a result, what comes after that will be considered a new word and included in the word count.

What if you have a number, a standalone letter, an email address, etc.? Do you consider those words? I don’t. Word processors will not give you the option to filter the counted words.

For instance, in a small experiment I did with Microsoft Word, when I entered the following text:

Ruby 1 2 3 " email@email.com

The word count was 6!

Well, what do we mean by a word anyway? As defined on the Oxford Dictionaries website, a word is:

A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.

Looking at this definition, the word count of the above text should evaluate to 1 and not 6. What should we do in this case? The power and flexibility of Ruby comes into play can save the day.

Let’s dive into the tutorial and see how we can tell Ruby what and what not to consider as words in practice.

OMIM® – Online Mendelian Inheritance in Man®

In the early 1960s, Dr.Victor A. McKusick initiated a database that served as a catalog of mendelian traits and disorders. At the time, it was called the Mendelian Inheritance in Man (MIM). The online version, OMIM®, which is a comprehensive compendium of human genes and genetic phenotypes, is updated on a daily basis and is available for free. It was created in 1985 and made available on the internet starting in 1987. The OMIM text contains information on all known mendelian disorders and over 15,000 genes.

Well, this valuable data is what we will be working with in this tutorial!

Get the OMIM® File

In this step, we will be downloading the OMIM® text file, which can be obtained using the following steps:

  1. Go to this anonymous ftp address: ftp://ftp.ncbi.nih.gov. You should have a dialogue box show up that looks something like the following:

dialogueBox

Choose Guest beside Connect as:, and then click the* Connect* button, in which case you’ll see the following directory:

directory

The text file we need is the omim.txt.Z file (66.3 MB), which can be found in the/repository/OMIM/ARCHIVE directory.

Unzip the file to get omim.txt (151.2 MB).

Counting the Number of Words

Having the text file we want to work with, let’s write a Ruby script that will return the number of words (the traditional way of counting). The script for performing this task can be written as follows:

text = File.open('omim.txt', 'r')
number_of_words = 0
text.each_line(){ |line| number_of_words = number_of_words + line.split.size }
puts number_of_words

You should get this large number: 22451516

Counting Only What You Consider Words

In this section, I will demonstrate some scenarios on how we can tell Ruby what to and not to consider as words when counting.

Scenario 1: Don’t Treat Standalone Numbers as Words

As mentioned earlier, Microsoft Word returned 6 as the word count to: Ruby 1 2 3 " email@email.com

It thus considered the numbers as words. Let’s fix that with Ruby. Regular Expressions are very handy for instructing Ruby on what we mean by standalone numbers. I discuss regular expressions a bit in one of my other blog posts: Hunting For The Gene Sequence.

Let’s take this step-by-step. The first thing we want to do is specify the start and the end of the string (i.e. the part we want to see in order to be considered a word). In this case, we can use \A and \Z to refer to the start and the end of the string, respectively.

After that, we want to specify that a number may be preceded by a minus (-) or a plus sign (+). This can be written as [- +].

A nice symbol we can use in regular expressions is the question mark symbol ?. The question mark symbol simply tells us to match zero or one of the previous character. For instance, if we write: [- +]?, this means that the value can be preceded by either -, +, or nothing.

We would now like to tell the regular expression to have zero or more numeric values. This can be written as [0-9]*. So, we have values in the range 0-9, and the asterisk * means to match zero or more of the previous character. Thus, if we don’t have any numeric value at this point that’s fine. We can also have values such as 01, 6, 9, 54, 565346, and so on.

Since we can have floating point numbers, we may encounter a dot . in some values (i.e., 5.43). Adding \.? to the regular expression says that a . is optional, but would be taken into consideration should it appear (zero or more of the previous character).

Finally, since . will be followed by a value, we can use: [0-9]+. The + symbol here means one or more (but not zero) of the previous characters.

The final regular expression to check if we have a numeric value now looks as follows: \A[-+]?[0-9]*\.?[0-9]+\Z

Scenario 2: Don’t Treat Standalone Letters as Words

The next scenario we want to look at is the case when we have standalone letters in the document (i.e., A, b, c, D).

This can simply be made using this regular expression: ^[a-zA-Z]$. The caret ^ means the beginning of the line, and the dollar sign $ means the end of the line.

Scenario 3: Don’t Treat Email Addresses as Words

This may be a bit tricky, but, let’s take it step-by-step.

Let me introduce you to \w+. \w+ tells us to match one or more word characters. This can be equivalent to [a-zA-Z0-9_]+, which matches any combination of letters, numbers, and underscores. We need this since part of the email could contain such a pattern.

The pattern above can be followed by any character. In regular expressions, the dot . means any character. Thus, telling the regular expression that it can contain any word character or dash -, followed by zero or more characters, is written as [\w+\-].?.

The entire portion of the regular expression that checks if we have an email address is:

\A([\w+\-].?)+@[a-z0-9\-]+(\.[a-z]+)*\.[a-z]+\Z

Putting It Altogether

Let’s now see how our Ruby script, including the above three scenarios plays out:

text = File.open('omim.txt', 'r')
number_of_words = 0
standalone_number = /\A[-+]?[0-9]*\.?[0-9]+\Z/
standalone_letter = /^[a-zA-Z]$/
email_address = /\A([\w+\-].?)+@[a-z0-9\-]+(\.[a-z]+)*\.[a-z]+\Z/
text.each_line(){ |line| number_of_words = number_of_words + line.split.count {|word|  word !~ standalone_number && word !~ standalone_letter && word !~  email_address }}
puts number_of_words

Running the script (it takes some time), the number of words we have is: 21636153

.

Did you notice the difference between counting the words generally and using our Ruby script? It’s a difference of 815,363 words! Wow!

Are the exceptions too much, meaning, are some legitimate words getting chopped? Refine the regular expressions above to make the script work for you and what you consider words.

Good luck!

  • g

    For every regular expression that you claim appropriately filters your sample text, I can come up with a “word” that breaks it. The point is, this is an area where new and solid research is being done in the field of Natural Language Processing, and the correct answer is not to simply reduce word-filtering down to something as benign as regular expressions. Rather, the best way is to go toward more sophisticated techniques involving machine learning algorithms and training data.

Recommended

Learn Coding Online
Learn Web Development

Start learning web development and design for free with SitePoint Premium!

Get the latest in Ruby, once a week, for free.