Ruby on Medicine: Counting Word Frequency in a File
Welcome to a new article in SitePoint’s Ruby on Medicine series, where I show how the Ruby programming language can be applied to tasks related to the medical domain.
Newbie, moderate, and advanced Ruby programmers can benefit from this series, in addition to health/medical researchers and practitioners looking forward to learning the Ruby programming language, and apply it to their domain.
In this article, I’m going to show you how we can use Ruby to check the frequency of occurrence of words in some file. Counting the frequency of occurrence of words can come in handy in large text files, such as the OMIM^®^ – Online Mendelian Inheritance in Man^®^ file, which we worked with in the last article.. The reason counting the frequency of words could be beneficial is that going through the list of words and their frequencies will give us a more sense of what the document is about. It can also pinpoint any mistakes and misspellings, especially when we have a dictionary to compare against. When a word from the dictionary is not listed in your output, for instance, you can simply conclude that a misspelling has occurred to that word, or abbreviations were used rather than the words themselves.
In this article, the file we will be working with is the OMIM® – Online Mendelian Inheritance in Man® file. Let’s get started.
Get the OMIM® File
I talked about how to obtain the OMIM® file in my last article: Ruby on Medicine: Counting Real Words with Ruby, but I will repeat it here, just in case you don’t want to go back and forth between articles.
The OMIM^®^ can be obtained following those steps:
Go to this anonymous ftp address: ftp://ftp.ncbi.nih.gov. You should have a dialogue box show up that looks something like the following:
Choose Guest beside Connect as:, and then click the Connect button, in which case you’ll see the following directory:
The text file we need is the omim.txt.Z file (66.3 MB), which can be found in the /repository/OMIM/ARCHIVE directory.
Unzip the file, in which case you will get omim.txt (151.2 MB).
Hashes
Since we will be checking the words in a file, and counting the frequency of occurrence of each word, hashes will be very helpful.
But, what are Ruby hashes?
When I started this section, I mentioned that we will be checking words in a file, and the frequency of occurrence of each word. Hashes represent a collection of key-value pairs. In our case, this would look as follows:
"word" => frequency
Where word
is some word in the file, and frequency
is a number representing the number of occurrences of the word in that file.
Thus, a hash looks pretty much like an array but, instead of an index with integer values, the indices are keys, which can be of any type.
Creating a hash in Ruby is very simple and can be done as follows, which will create an empty hash:
Hash.new
If you want to create a hash with a default value, you can do the following:
Hash.new(0)
The above hash ha a default value of 0
. This means that, if you tried to access a key that is not present in a hash created with a default value, the default value is what will be returned. To clarify what I mean here, look at the example below, where the two puts
in the following script will return 0
:
frequency = Hash.new(0)
puts "#{frequency[0]}"
puts "#{frequency[653]}"
For more information about Ruby hashes, you can read this article and the docs.
Counting the Number of Occurrences
In this section, we will be building the Ruby program that will check the frequency of occurrence of each word in the file, and store the results in an output file.
The first thing we want to do is create a new hash, and assign it the default value 0
:
frequency = Hash.new(0)
We want to store the results in a file, so the Ruby script will write
the results to an output file. We can use File.open() with the w
(write) mode:
output_file = File.open('wordfrequency.txt', 'w')
The front-end of the operation is reading in the OMIM® file, which, in our case, is omim.txt. The statement for this operation will look as follows:
input_file = File.open('omim.txt', 'r')
As we are looking through the file for words, the Ruby function we need is scan(pattern). Ruby documentation mentions the following about this function:
scan iterates through *str*, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
When we talk about pattern matching, in this case matching the words in the file, regular expressions (regex) come in handy. I have talked about regular expression in another article: Ruby on Medicine: Hunting For The Gene Sequence.
The main regex we need in this tutorial is \b
, that is, the word boundary. As you may have guessed from its name, it helps us match the words.
See the following examples to see how the word boundary works:
\bhat\b
This regex matches hat
in red hat
.
If we take only one boundary as follows:
\bhat
This will match a word with hat
at the beginning, such is in hatrat
.
And, if we wanted to match hat
at the end, such as rathat
, we can do this using this regex:
hat\b
Here is script that will return the frequency of the occurrence of each word in the file together:
frequency = Hash.new(0)
input_file = File.open('omim.txt', 'r')
output_file = File.open('wordfrequency.txt', 'w')
input_file.read.downcase.scan(/\b[a-z]{3,16}\b/) {|word| frequency[word] = frequency[word] + 1}
frequency.keys.sort.each{|key| output_file.print key,' => ',frequency[key], "\n"}
exit
Before I show you the output, I just want to quickly clarify some points in the above script.
The downcase
function returns all the uppercase letters replaced with their lowercase counterparts. We use it here so word matches occur regardless of their case.
The regular expression /\b[a-z]{3,16}\b/
is basically telling us to return all the words which length (number of characters) is between 3 and 16. For instance, we may skip those words with no particular interest such as if, or, on. Words with length larger than 16 may be a gene sequence, which we will not consider a word.
Finally, sort
will sort the keys (i.e. words) listed in the output file alphabetically.
Output
The list of words and their frequency of occurrence in the OMIM® file, using the Ruby script above, will generate this output file: wordfrequency.txt (2.37 MB)
In the list, you may notice some words with very little frequency. In a large text file like OMIM^®^, what would that indicate? Are they words with no meaning? Misspelled words? What do you think?
The last two articles in this series covered how to use regular expressions to count specific words and all words, respectively. My hope is that these two examples give you tools to apply this technique in your own domain.