Code Safari: Linguistic Analysis with Lingua

Welcome to Code Safari. In this series, Xavier Shay guides you through the source code of popular gems and libraries to discover new techniques and idioms.

On another blog I write for, I was curious about calculating some statistics on the readability of my writing. How does it stack up on the Flesch-Kincaid scale? How long were my articles? How did I compare to my co-author? I love this sort of problem—not necessarily that interesting, but possibly achievable within a small enough time frame to make it worthwhile. Even without a useful output, it’s a perfect training problem to flex your programming muscle on.

There are two steps to the process: massaging the data into a suitable format for processing, then generating statistics from that data. I have tackled similar problems in the past, so had a good idea where to start from. This familiarity with existing libraries of your language(s) is argument alone for trying any crazy experiment that comes into your head.

I use nanoc to compile the blog, and my source data was in the form of Markdown files with a YAML preamble. Here is a sample document:

---
title: My Blog Post
created_at: 2011-04-01 10:00
---
&nbsp;
Indubitably I am writing a blog!
&nbsp;
    puts "This is code and shouldn't be included"
&nbsp;
This is the conclusion of my fascinating blog.

Ruby has a great library for parsing Markdown (actually it has a few!) called kramdown. To be able to use it though I first had to extract the meta-data (title, created at) and strip it from the document. This may not have been too hard, but why write your own parsing algorithm when someone else has done it for you? Nanoc must have some code somewhere to do this already…

I unpacked the nanoc source tree and went spelunking. Nanoc doesn’t use YAML for much, so I reasoned that it might be a good thing to search for.

$ gem unpack nanoc
$ cd nanoc3-3.1.3
$ ack YAML
lib/nanoc3/base/ordered_hash.rb
172:              YAML::quick_emit(object_id, opts) {|emitter|
&nbsp;
lib/nanoc3/base/site.rb
369:        @config = DEFAULT_CONFIG.merge(YAML.load_file(config_path).symbolize_keys)
&nbsp;
lib/nanoc3/cli/commands/create_site.rb
11:      # Converts the given array to YAML format
&nbsp;
lib/nanoc3/data_sources/filesystem.rb
86:          meta                = (meta_filename &amp;&amp; YAML.load_file(meta_filename)) || {}
233:        meta    = YAML.load_file(meta_filename) || {}
255:      meta    = YAML.load(pieces[2]) || {}
&nbsp;
lib/nanoc3/data_sources/filesystem_unified.rb
85:          io.write(YAML.dump(meta).strip + "n")
&nbsp;
lib/nanoc3/data_sources/filesystem_verbose.rb
18:  # or the layout’s metadata, formatted as YAML.
61:      File.open(meta_filename,    'w') { |io| io.write(YAML.dump(attributes.stringify_keys)) }
$

YAML.load_file(meta_filename) seems like a good candidate to me, and the file name (data_sources/filesystem.rb) is even more promising. Cracking open the file we find a method that does exactly what we want. It’s a bit long with all the error checking, so I’ll only include an edited version here:

# nanoc/lib/nanoc3/data_sources/filesystem.rb

# Parses the file named `filename` and returns an array with its first
# element a hash with the file's metadata, and with its second element the
# file content itself.
def parse(content_filename, meta_filename, kind)
  data = File.read(content_filename)
  pieces = data.split(/^(-{5}|-{3})s*$/)
&nbsp;
  meta    = YAML.load(pieces[2]) || {}
  content = pieces[4..-1].join.strip
&nbsp;
  [ meta, content ]
end

I copied this entire method verbatim into my script, with a reference to its source location in case I needed to go back to it. For a quick prototype script like this one, my goal is to get it working as quick as possible. Any thoughts on code reuse or architecture can be suspended for the moment.

Using this method, we can take the first step in constructing our parser:

require 'kramdown'
files = Dir["content/articles/*.md"]
&nbsp;
files.each do |file_name|
  meta, content = parse(file_name, nil, nil)
  doc = Kramdown::Document.new(content)
  puts doc.inspect
end

The inspect output, while dense, gives valuables clues as to the next step. We want to exclude any non-text elements (such as the code block) from our statistics.

<KD:Document: options={:template=>"", :auto_ids=>true, :auto_id_prefix=>"",
:parse_block_html=>false, :parse_span_html=>true, :html_to_native=>false,
:footnote_nr=>1, :coderay_wrap=>:div, :coderay_line_numbers=>:inline,
:coderay_line_number_start=>1, :coderay_tab_width=>8, :coderay_bold_every=>10,
:coderay_css=>:style, :entity_output=>:as_char, :toc_levels=>[1, 2, 3,
4, 5, 6], :line_width=>72, :latex_headers=>["section", "subsection", "subsubsection",
"paragraph", "subparagraph", "subparagraph"], :smart_quotes=>["lsquo",
"rsquo", "ldquo", "rdquo"]} root=<kd:root nil {:encoding=>#<Encoding:UTF-8>,
:abbrev_defs=>{}} [<kd:p nil [<kd:text "Indubitably I am writing a blog!"
nil>]>, <kd:blank "n" nil>, <kd:codeblock "puts "This is code and shouldn't
be included"n" nil>, <kd:blank "n" nil>, <kd:p nil [<kd:text "This
is the conclusion of my fascinating blog." nil>]>]> warnings=[]>

We can see kramdown has created different types of nodes for the content, and the only ones we are interested in are kd:text. All nodes appear to be in a tree structure descendent from kd:root, so a recursive filtering function should be sufficient to extract all of the text nodes. You can consult the kramdown documentation for the exact API of Document, but you can also get a long way just by guessing. root, type and children are common enough names for this type of tree structure, and this is no exception.

def extract_text(elem)
  value = elem.type == :text ? [elem.value] : []
  value + elem.children.map {|x| extract_text(x) }.flatten
end
&nbsp;
extract_text(doc.root).join(' ')
# => "Indubitably I am writing a blog! This is the conclusion of my fascinating blog."

Excellent. Let’s move on to the analysis of the text.

Part Two

From a past project, I already knew about the Lingua library.

Lingua::EN::Readability is a Ruby module which calculates statistics on English text. It can supply counts of words, sentences and syllables. It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.

It harks from a time before Rubygems, and suggests a tar.gz download to install. This isn’t so difficult, but ideally we would stay within our dependency system of choice. With many of these old projects, people have forked them so to package them properly or make them work with the latest versions of Ruby. GitHub is the best place to find these.

Searching for ‘Lingua’ yields a few results, with the top one being a winner. It has a gemspec and some bug fixes on top of the original library. We can install it the same as all our other Ruby libraries.

gem install lingua

Usage is trivial, and completes our report:

require 'lingua'
require 'kramdown'
files = Dir["content/articles/*.md"]
&nbsp;
def parse(content_filename, meta_filename, kind)
  # ... from above
end
&nbsp;
files.each do |file_name|
  meta, content = parse(file_name, nil, nil)
  doc = Kramdown::Document.new(content)
  text = extract_text(doc.root).join(" ")
  report = Lingua::EN::Readability.new(text)
&nbsp;
  puts "%s: %.2f" % [meta['title'], report.kincaid]
end
&nbsp;
# My Blog Post: 7.37

Readable by your average seventh grader. Not too shabby! This post itself scores 8.11, which I trust is accessible to the majority of the audience.

Wrapping Up

Attempting small, semi-practical projects like this one are a great way to learn about your programming ecosystem, and improve your algorithmic chops. They are the programmer’s equivalent of a musician’s scales. Here are some extra problems you can try out for more practice:

extract_text strips out punctuation, meaning contractions come out incorrect in the resulting text (“I’m” becomes “I m”). This is fine for this analysis, but how would you fix it for feeding into a text to speech converter? (protip if you are on a mac: try say hello at the command line)
created_at is still a text value in the meta-data. Convert it to an appropriate Time format.
If you keep a blog yourself, try running the above analysis over it.

Let us know how you go in the comments. Tune in next week for more exciting adventures in the code jungle.