Counting Real Words with Ruby

Key Takeaways

Ruby’s flexibility allows for more nuanced word counting than typical word processors, enabling users to set their own criteria for what constitutes a word, such as excluding standalone numbers or letters, and email addresses.
The tutorial uses the OMIM® text file, a comprehensive compendium of human genes and genetic phenotypes, to demonstrate how to write a Ruby script that counts words based on user-defined criteria. The script uses regular expressions to specify what to consider as a word.
The difference between traditional word counting and using a Ruby script with specific criteria can be significant. In the tutorial, the difference amounted to 815,363 words, demonstrating the importance of refining the regular expressions to accurately reflect what the user considers as words.

Wait a minute, do you mean this blog post is just about counting words in a document? First, it is easy enough, and second, this can be done by any word processor on the fly.

You are right in that it is about counting words. However, the aim of this blog post is to show how flexible Ruby can be in meeting our requirements when counting what we determine are considered words. This is in opposition to the word processors we use, which likely will not be able to count based on such criteria.

Let me clarify this point a bit further. When a word processor is counting words, it takes the white space as a delimiter. As a result, what comes after that will be considered a new word and included in the word count.

What if you have a number, a standalone letter, an email address, etc.? Do you consider those words? I don’t. Word processors will not give you the option to filter the counted words.

For instance, in a small experiment I did with Microsoft Word, when I entered the following text:


Ruby 1 2 3 " email@email.com

The word count was 6!

Well, what do we mean by a word anyway? As defined on the Oxford Dictionaries website, a word is:

A single distinct meaningful element of speech or writing, used with others (or sometimes alone) to form a sentence and typically shown with a space on either side when written or printed.

Looking at this definition, the word count of the above text should evaluate to 1 and not 6. What should we do in this case? The power and flexibility of Ruby comes into play can save the day.

Let’s dive into the tutorial and see how we can tell Ruby what and what not to consider as words in practice.

OMIM^® – Online Mendelian Inheritance in Man^®

In the early 1960s, Dr.Victor A. McKusick initiated a database that served as a catalog of mendelian traits and disorders. At the time, it was called the Mendelian Inheritance in Man (MIM). The online version, OMIM^®, which is a comprehensive compendium of human genes and genetic phenotypes, is updated on a daily basis and is available for free. It was created in 1985 and made available on the internet starting in 1987. The OMIM text contains information on all known mendelian disorders and over 15,000 genes.

Well, this valuable data is what we will be working with in this tutorial!

Get the OMIM^® File

In this step, we will be downloading the OMIM^® text file, which can be obtained using the following steps:

Go to this anonymous ftp address: ftp://ftp.ncbi.nih.gov. You should have a dialogue box show up that looks something like the following:

dialogueBox

Choose Guest beside Connect as:, and then click the* Connect* button, in which case you’ll see the following directory:

Counting the Number of Words

Having the text file we want to work with, let’s write a Ruby script that will return the number of words (the traditional way of counting). The script for performing this task can be written as follows:


text = File.open('omim.txt', 'r')
number_of_words = 0
text.each_line(){ |line| number_of_words = number_of_words + line.split.size }
puts number_of_words

You should get this large number: 22451516

Counting Only What You Consider Words

In this section, I will demonstrate some scenarios on how we can tell Ruby what to and not to consider as words when counting.

Scenario 1: Don’t Treat Standalone Numbers as Words

As mentioned earlier, Microsoft Word returned 6 as the word count to: Ruby 1 2 3 " email@email.com

It thus considered the numbers as words. Let’s fix that with Ruby. Regular Expressions are very handy for instructing Ruby on what we mean by standalone numbers. I discuss regular expressions a bit in one of my other blog posts: Hunting For The Gene Sequence.

Let’s take this step-by-step. The first thing we want to do is specify the start and the end of the string (i.e. the part we want to see in order to be considered a word). In this case, we can use \A and \Z to refer to the start and the end of the string, respectively.

After that, we want to specify that a number may be preceded by a minus (-) or a plus sign (+). This can be written as [- +].

A nice symbol we can use in regular expressions is the question mark symbol ?. The question mark symbol simply tells us to match zero or one of the previous character. For instance, if we write: [- +]?, this means that the value can be preceded by either -, +, or nothing.

We would now like to tell the regular expression to have zero or more numeric values. This can be written as [0-9]*. So, we have values in the range 0-9, and the asterisk * means to match zero or more of the previous character. Thus, if we don’t have any numeric value at this point that’s fine. We can also have values such as 01, 6, 9, 54, 565346, and so on.

Since we can have floating point numbers, we may encounter a dot . in some values (i.e., 5.43). Adding \.? to the regular expression says that a . is optional, but would be taken into consideration should it appear (zero or more of the previous character).

Finally, since . will be followed by a value, we can use: [0-9]+. The + symbol here means one or more (but not zero) of the previous characters.

The final regular expression to check if we have a numeric value now looks as follows: \A[-+]?[0-9]*\.?[0-9]+\Z

Scenario 2: Don’t Treat Standalone Letters as Words

The next scenario we want to look at is the case when we have standalone letters in the document (i.e., A, b, c, D).

This can simply be made using this regular expression: ^[a-zA-Z]$. The caret ^ means the beginning of the line, and the dollar sign $ means the end of the line.

Scenario 3: Don’t Treat Email Addresses as Words

This may be a bit tricky, but, let’s take it step-by-step.

Let me introduce you to \w+. \w+ tells us to match one or more word characters. This can be equivalent to [a-zA-Z0-9_]+, which matches any combination of letters, numbers, and underscores. We need this since part of the email could contain such a pattern.

The pattern above can be followed by any character. In regular expressions, the dot . means any character. Thus, telling the regular expression that it can contain any word character or dash -, followed by zero or more characters, is written as [\w+\-].?.

The entire portion of the regular expression that checks if we have an email address is:


\A([\w+\-].?)+@[a-z0-9\-]+(\.[a-z]+)*\.[a-z]+\Z

Putting It Altogether

Let’s now see how our Ruby script, including the above three scenarios plays out:


text = File.open('omim.txt', 'r')
number_of_words = 0
standalone_number = /\A[-+]?[0-9]*\.?[0-9]+\Z/
standalone_letter = /^[a-zA-Z]$/
email_address = /\A([\w+\-].?)+@[a-z0-9\-]+(\.[a-z]+)*\.[a-z]+\Z/
text.each_line(){ |line| number_of_words = number_of_words + line.split.count {|word|  word !~ standalone_number && word !~ standalone_letter && word !~  email_address }}
puts number_of_words

Running the script (it takes some time), the number of words we have is: 21636153

Did you notice the difference between counting the words generally and using our Ruby script? It’s a difference of 815,363 words! Wow!

Are the exceptions too much, meaning, are some legitimate words getting chopped? Refine the regular expressions above to make the script work for you and what you consider words.

Good luck!

Frequently Asked Questions (FAQs) about Counting Real Words with Ruby

How does Ruby count words in a string?

Ruby uses the split method to count words in a string. This method splits a string into an array of substrings based on a delimiter, which is a space by default. The length or size method is then used to count the number of elements in the array, which corresponds to the number of words in the string. For example, "Hello, world!".split.size would return 2.

What is the difference between counting words and counting real words in Ruby?

Counting words in Ruby simply involves splitting a string into substrings based on spaces and counting the number of substrings. However, this method does not account for punctuation or special characters. Counting real words involves additional steps to remove or ignore punctuation and special characters, ensuring that only actual words are counted.

How can I count real words in a string that contains punctuation?

You can use the gsub method in Ruby to replace punctuation with spaces before splitting the string into words. For example, "Hello, world!".gsub(/[.,!?]/, ' ').split.size would return 2. This ensures that punctuation does not interfere with the word count.

How can I count words in a string ignoring case?

You can use the downcase or upcase method in Ruby to convert all characters in the string to lowercase or uppercase before splitting the string into words. This ensures that words are counted correctly regardless of case. For example, "Hello, World!".downcase.split.size would return 2.

How can I count the frequency of each word in a string?

You can use the each_with_object method in Ruby to create a hash where the keys are the words in the string and the values are the frequencies of each word. For example, "Hello, world! Hello, Ruby!".downcase.split.each_with_object(Hash.new(0)) { |word, count| count[word] += 1 } would return {"hello"=>2, "world!"=>1, "ruby!"=>1}.

How can I count words in a string that contains special characters?

You can use the gsub method in Ruby to replace special characters with spaces before splitting the string into words. For example, "Hello@world!".gsub(/[@]/, ' ').split.size would return 2. This ensures that special characters do not interfere with the word count.

How can I count words in a string that contains numbers?

You can use the gsub method in Ruby to replace numbers with spaces before splitting the string into words. For example, "Hello1world!".gsub(/[0-9]/, ' ').split.size would return 2. This ensures that numbers do not interfere with the word count.

How can I count words in a string ignoring whitespace?

You can use the squeeze method in Ruby to remove extra whitespace before splitting the string into words. For example, "Hello, world!".squeeze(' ').split.size would return 2. This ensures that extra whitespace does not interfere with the word count.

How can I count words in a string that contains hyphenated words?

You can use the split method in Ruby with a regular expression to split the string into words at spaces or hyphens. For example, "Hello-world!".split(/[\s-]/).size would return 2. This ensures that hyphenated words are counted as two separate words.

How can I count words in a string that contains contractions?

You can use the split method in Ruby with a regular expression to split the string into words at spaces or apostrophes that are not part of contractions. For example, "I'm a Ruby programmer.".split(/[\s'](?=[a-z])/i).size would return 4. This ensures that contractions are counted as a single word.