Seeking Lovecraft, Part 1: An introduction to NLP and the Treat Gem

Personal Development Concept using NLP We just started Unknowable Horror LLC. Our plan is sift through the vast ocean of bytes that is the internet in order to find the next H.P. Lovecraft so that we can make a fortune selling masterful cosmic horror.

Alternatively, if we can’t find the next H.P. Lovecraft, we might just write our own stories and keep running our program against them, tweaking until we cannot be distinguished from Lovecraft. Then make barrels of money.

In this project, we will do the following:

Part 1
- Learn how to analyze natural language using the treat gem
Part 2
- Visualize the differences between different authors and stories to see how we might fingerprint authors
- Build a system that will determine whether a story was likely written by H.P. Lovecraft

A thinking machine? Detecting his work? If only he knew.

Following along

The data for this project is available at http://github.com/rlqualls/author-identification-tutorial.

Obtaining Lovecraft Stories

All non-Lovecraft stories used in this tutorial are available in the github repository. They are in the public domain and were obtained from the gutenberg project. They have been pre-formatted for the best results, removed of special formatting that could throw off treat’s text processors.

However, there is a debate as to whether many of H.P. Lovecraft’s stories fall under copyright in the United States, so these have been left out, save for “The Shunned House.” You will need to get your own copies of the rest of his stories, but they are easily obtained at places like the following:

I recommend copying and pasting the html stories into text files, removing titles and any special formatting that may be present. Treat currently considers every newline character a new paragraph, so check that each paragraph is all on one line. Of course, make sure that Lovecraft’s stories are in the public domain in your country first.

Natural Language Processing

Its voice was a kind of loathsome titter, and it could speak all languages

NLP is a combined field of artificial intelligence and linguistics concerned with the ability to understand natural language computationally. It represents one of the most formidable barriers that computers face when navigating the human world. As such, engaging in any kind of NLP task is sure to not be a trivial undertaking, including this one. Before we go any further, it’s important that we understand what is involved.

Take this string, for example:

"the world is indeed comic, but the joke is on mankind."

At the lowest level, a computer sees the string as a array of bytes. Next, it takes those bytes and associates with them some kind of representation, like ASCII or UTF-8. For example, here is the string encoded in ASCII via hexadecimal. You can use asciitohex to see for yourself.

74 68 65 20 77 6F 72 6C 64 20 69 73 20 69 6E 64 65 65 64 20 63 6F 6D 69
    63 2C 20 62 75 74 20 74 68 65 20 6A 6F 6B 65 20 69 73 20 6F 6E 20 6D 61
    6E 6B 69 6E 64 2E

At this point, the computer sees characters, but not words. To get words from the previous sentence, we need to tell the computer what symbols make up its words. This process is called tokenization. If we were to break up the string into an array of tokens, they would look like this:

['the', 'world', 'is', 'indeed', 'comic', ',', 'but', 'the', 'joke', 'is',
 'on', 'mankind', '.']

After tokenization, the computer knows which characters are words and which are punctuation, but it doesn’t know what any of them mean. This is where things go from being fairly simple to extremely complex.

If we were writing a programming language, this is the point where we would build a parser. We could come up with straightforward rules like “all statements end with a newline character” or “anything between the ‘do’ and ‘end’ tokens is a block.”

Now, building a parser can hardly be called trivial, but this is where we start to appreciate the difference between a programming language and a natural language.

Look at what problems we run into with the first two words of our sentence: “the world.” Which world? The whole world? A world referenced a few sentences ago? What if we’ve been talking about a particular world all along, but we’ve avoided the token “world” until now? Next, what does it mean for a world to be comic? Perhaps the author meant “a comic?” Also, “the joke” has not been mentioned previously, and now it appears to be physically on top of “mankind”.

At this point, our system sees “comic” and “joke”, determines the sentence to be humorous and lighthearted, and informs us that it has understood everything.

What does this mean for our project? Well, we can’t possibly hope to teach the computer to “get” Lovecraft. It’s tough to give computers the ability to understand every idiomatic aspect of language. But computers can do some things that we cannot do.

For example, it would not be a trivial task for a human to manually count the number of nouns, adjectives, verbs, and adverbs in a story. A computer, on the other hand, can perform such a feat with little effort. As we will see later, a simple count like that can tell us a lot about a document.

Finding hidden patterns in numbers has been the most successful way of approaching artificial intelligence to date, and our approach will be no different here.

Treat – The Ruby NLP Toolkit

According to the project page,

The Treat project aims to build a language – and algorithm-agnostic framework for Ruby with support for document retrieval, text chunking, segmentation and tokenization, natural language parsing, part-of-speech tagging, keyword extraction and named entity recognition.

Did you get all that?

The number of things it can do may be rather overwhelming, but we’re going to keep things simple for this tutorial.

Objectives

Successfully install the treat gem
Do some basic NLP
Learn about treat’s tree organization
Learn how to get some sweet metrics using treat

Installation

Installing treat can be tricky. To start off, we’ll need to install the treat gem with:

gem install treat --version 2.0.7

Treat breaks up the rest of its dependencies into language packages. Since we’ll be working in English for this tutorial, we’ll need to install the English language package.

Before that, however, make sure Java is installed and your $JAVA_HOME environment variable is set to the Java installation folder (the one with bin and include).

Next, open up an irb shell or write a Ruby script and run the following:

require 'treat'
Treat::Core::Installer.install 'english'

This could take a while, depending on your system. Watch to make sure it installs the stanford-core-nlp package.

If Java is not installed or $JAVA_HOME is not configured properly, that step will fail, but the installation will otherwise appear to succeed. The effect is that some of Treat’s functionality will error out when that package is needed.

If you are still having problems with this step, try installing the most up to date version of the gem, since the language pack installer could conceivably have outdated code. However, that would mean that some aspects of this tutorial may not function as described.

Note: Some of treat’s functionality requires other binaries to be installed on your system like Ocrupus for reading images. You should not need to install anything else for this tutorial, but if you want to use all of treat, check The Treat Manual for more information.

Once treat is installed, we can start playing with it. Let’s look at some examples.

require 'treat'
include Treat::Core::DSL

'darkness'.category       # => "noun"
'abyss'.plural            # => "abysses"
'dreaming'.stem           # => "dream"
'think'.present_participle# => "thinking"
'towering'.synonyms       # => ["eminent", "lofty", "soaring", "towering"]
'perfection'.hypernyms    # => ["state", "ideal", "improvement"]

Navigating a Treat Tree

At one end of that tomb, its curious roots displacing the time-stained blocks of Pentelic marble, grows an unnaturally large olive tree of oddly repellent shape; so like to some grotesque man, or death-distorted body of a man, that the country folk fear to pass it at night when the moon shines faintly through the crooked boughs.

Let’s go ahead and process our first story. Treat provides us with the ‘document’ method.

story =
    document('collections/h_p_lovecraft/the_shunned_house.txt')

The story is now in a document object, but it has not been processed. If you try story.paragraphs or story.sentences, you will get an empty array. This is because, although the document tree exists, it doesn’t have nodes. Treat provides different textual processors for creating nodes.

Chunkers – Breaks a document into sections and paragraphs
Segmenters – Breaks paragraphs and sections into sentences and titles
Tokenizers – Breaks sentences and titles into words

For example, if we wanted to break our story into paragraphs, we would use a chunker like this:

# Run the chunking processor
story.chunk

# Now we can access an array of all of the story's paragraphs
story.paragraphs

#Or the text of any paragraph
story.paragraphs.first.to_s
story.paragraphs[3].to_s

TIP

It’s important that the chunker does its job correctly because every other processor depends on its results. If you chunk a document named story, and story.paragraphs[0].to_s does not stop on a sentence boundary (period), it is possible that sentence and phrase nodes will not contain the correct text.

If you are having trouble getting correct results, double check that newline characters only delineate paragraphs and nothing else. An easy way to make this check is to simply enable line numbers in your editor.

Now that our story is chunked into paragraphs, we can segment the first paragraph into sentences. Note that since we are selectively segmenting the first paragraph, story.sentences will only return sentences in the first paragraph for now.

# Run the segmenter on the first paragraph
story.paragraphs.first.segment

#=> An array of sentences in the first paragraph
story.paragraphs.first.sentences
story.sentences

Using Apply

We can use also the apply method to fill the tree. This time, in addition to chunking, and segmenting, we’ll tokenize the document into words.

# Instantiate a fresh document (we can't re-process documents)
story = document '/path/to/story'

# Run the chunker, segmenter, and tokenizer in succession.
story.apply(:chunk, :segment, :tokenize)

#=> An array containing all of the story's Word objects

story.words
#=> The first Word object in the first paragraph
story.paragraphs.first.words.first

#=> The number of words in the first sentence of the first paragraph
story.paragraphs[0].sentences[0].word_count

TIP

You will need to use the to_s in order to get string values from nodes.

# This gets an object, not a string
story.paragraphs.first

# This gets the actual text of the entire paragraph
story.paragraphs.first.to_s
# This will print nothing
puts story.words.first

# This prints the first word
puts story.words.first.to_s

For Lovecraft’s The Shunned House, story.sentences[8].to_s returns:

"The house was -- and for that matter still is -- of a kind to attract the attention of the curious."

Adding Parts of Speech Nodes

We have access to paragraphs, sentences, and words so far. But wouldn’t it be nice to know how many nouns, verbs, and adjectives are in each paragraph, for example? We could write a helper function that does something like this:

node_nouns = node.words.select { |word| word.category == "noun" }

However, this isn’t necessary. By feeding :category to #apply, we can decorate the tree with parts-of-speech nodes (Note: this assumes we previously applied :chunk, :segment, and :tokenize).

story.apply(:category)

Now it’s possible to do things like the following:

# Get an array of all Word objects that are nouns
story.nouns

# Get an array of the lengths of all the verbs
story.verbs.map { |v| v.to_s.length }

# Get the number of conjuctions used
story.conjunction_count

More Than One Story – Collections

Treat collections serve two purposes. First, they make it possible to access nodes from different documents via the same tree. Second, they allow us to access documents in a directory without knowing their file names. All you need to do is pass in a stories directory path to Kernel#collection (assuming the treat DSL has already been included).

stories = collection 'collections/edgar_a_poe'

# Process the entire collection
stories.apply(:chunk, :segment, :tokenize, :category)

# Gets the noun percentage out of all the stories to two decimal places
(stories.noun_count.to_f / stories.word_count).round(2)

# Prints out the path of every story in the collection
stories.each_document do |story|
  puts story.file
end

Metrics – Word Popularity

Various metrics can be used in an attempt to fingerprint authors. We might use the average number of sentences per paragraph, for example.

Unfortunately, with simple metrics, we run into the risk of confusing two authors who write in similar styles. Lovecraft was influenced by Nathaniel Hawthorne and Edgar Allan Poe, and some of his simple metrics show this. To prevent confusion of similar authors we need to compare the actual content of authors’ stories.

A content metric that is easily available to us is word popularity. If a story’s top 100 words are mostly within another story’s bottom 100 words, then those stories are probably not very similar.

Treat provides Countable#frequency_of which returns the number of times a word occurs. If we create a hash where every key is a word and every value is the number of times that word appears, we can sort it to get the word popularity.

# Start off with an empty hash object
word_hash = {}

# Assign the word's frequency to its key in word_hash
# Note: frequency_of does perform #downcase internally,
# but iterating over a uniq downcased array prevents iterating over
# unnecessary instances in the story
downcased_words = story.words.map { |word| word.to_s.downcase }.uniq
downcased_words.each do |w|
  word_hash[w] = story.frequency_of(w)
end

# Create an array of [key, value] arrays, sorted greatest-to-least
word_popularity = word_hash.sort_by { |key, value| value }.reverse

This will get the word popularity of all words in the story, but we might not actually be interested in all of them. Words like “the” or “and” are not specific to any type of story, and they take up a lot of space in the upper rankings. Since adjectives help set the tone of a story, we can get a lot of information by ranking them alone.

adjective_popularity = word_popularity.select do |key, value|
  key.category == "adjective"
end

Here are the top 30 adjectives that I got from The Shunned House:

[["one", 35], ["more", 30], ["other", 19], ["two", 17], ["most", 16],
["old", 15], ["new", 14], ["many", 14], ["first", 12], ["great", 12],
["certain", 11], ["last", 11], ["strange", 10], ["such", 10],
["french", 9], ["same", 7], ["white", 7], ["next", 7], ["hideous", 7],
["light", 7], ["few", 6], ["ancient", 6], ["own", 6], ["sinister", 5],
["broken", 5], ["proper", 5], ["evil", 5], ["thin", 5], ["horrible", 5],
["terrible", 5], ["peculiar", 5]]

It’s obvious from looking at these words that the piece is most likely a work of horror with words like “strange” and “sinister” in the upper rankings.

Quick Analysis – Noun Percentages

Before we wrap up this introduction to treat, Let’s see if we can find any distinguishing information between a news article and a work of fiction. We will compare the noun-percentages of the collections. There is a small script in the scripts folder called noun_percentages.rb.

# !/usr/bin/env ruby

require 'pathname'
require 'treat'
include Treat::Core::DSL

def process_collection(path)
  puts "Author: #{Pathname.new(path).basename}"
  paths = Dir.glob(path + "/*")
  paths.each do |story_path|
    story = document story_path
    story.apply(:chunk, :segment, :tokenize, :category)
    noun_percentage = (story.noun_count / story.word_count.to_f).round(2)
    puts "#{Pathname.new(story_path).basename}: #{noun_percentage}"
  end
  puts ""
end

process_collection 'collections/edgar_a_poe'
process_collection 'collections/nathaniel_hawthorne'
process_collection 'collections/h_p_lovecraft'
process_collection 'collections/philip_k_dick'
process_collection 'collections/news'

This script goes inside each author folder, processes each story, and prints the percentage of words than are nouns in that story. Run it with:

ruby scripts/noun_percentages.rb

You should get results like this:

Author: edgar_a_poe
the_masque_of_red_death.txt: 0.2
the_fall_of_the_house_of_usher.txt: 0.2
the_pit_and_the_pendulum.txt: 0.19
the_black_cat.txt: 0.2
the_tell_tale_heart.txt: 0.16
the_premature_burial.txt: 0.19

Author: nathaniel_hawthorne
the_man_of_adamant.txt: 0.21
the_maypole_of_merry_mount.txt: 0.24
the_birth_mark.txt: 0.22
young_goodman_brown.txt: 0.24
the_minister's_black_veil.txt: 0.22

Author: h_p_lovecraft
the_shunned_house.txt: 0.24

Author: philip_k_dick
beyond_the_door.txt: 0.19
beyond_lies_the_wub.txt: 0.22
second_variety.txt: 0.24
the_variable_man.txt: 0.27

Author: news
obama_egypt.txt: 0.34
lab_mice.txt: 0.3
cambodian_vote.txt: 0.41
syria_war.txt: 0.37

Notice anything? The news articles have significantly higher noun percentages than the works of fiction. This is because fiction tends to be highly descriptive, with a higher number of stylistic words taking up the word space. News, on the other hand, is usually written to be quick and easy to read, so superfluous words are avoided.

It appears that, while they might tell us whether a story is likely news, noun percentages are probably not enough to fingerprint Lovecraft or any author. In Part 2, we will build a more detailed analysis system which will help us get to the bottom of what makes a story truly Lovecraftian.