Awesome Autocomplete: Trigram Search in Rails and PostgreSQL

Vinoth
Share

Screenshot 2015-11-07 15.28.34

PostgreSQL mostly known as Postgres, is one of the most mature and robust databases out there. It’s a multi-platform, open source and the second most used open source DBMS.

Today we will see about how to implement a basic autocomplete search using a trigram in Postgres and Rails. The tutorial is split into three parts:

  1. What is a trigram?
  2. Trigram in Postgres
  3. Implementing trigram in a sample rails app

What is a Trigram?

A trigram is nothing but an n-gram with three letter sequence. So, what is an n-gram? From wikipedia,

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

Well, what does that mean, exactly? It means finding matching words by maintaining variable sequences of characters in a word.

In a trigram the variable sequence in n-gram is 3. For finding the similarity between two words, wordA and wordB, wordA is split into three letter sequences and compared with the three letter sequence combinations computed from wordB. The comparison aims to find the number of shared sets between the two words. The more number of sequence matches means the high similarity between the words. This becomes very useful in terms of autocompletion.

Each word is treated with two spaces prefixed and one space suffixed to balance the number of trigrams for a n-character word. That’s confusing, so let’s have an example.

Assume we have a word group that consists of three words [google, toddle, beagle] and the search term is googlr. We need to find the best matching word from the batch for the search term. First the batch words are split with three letter groups:

google - g, go, goo, oog, ogl, gle, le
toddle - t, to, tod, odd, ddl, dle, le
beagle - b, be, bea, eag, agl, gle, le

The trigram groups of three letters will be calculated for the search term and compared to the words in for the batch for the sequences they share:

g, go, goo, oog, ogl, glr, lr

google - 5
toddle - 0
beagle - 0

The similarity is calculated using the number of trigrams they share, which in our case is quite trivial. So google will top the similarity index.

For the second use case, let’s say the search term is just gle. The trigram is:

g, gl, gle, le

Matches -
    google - 3
    toddle - 1
    beagle - 2

For this, google will top the result followed by beagle. Once you are comfortable with this concept, we can move on to how trigrams are implemented in Postgres.

Trigram in PostgreSQL

Postgres supports trigrams through an extension called pg_trgm which is an officially supported extension. It’s worth noting that pgtrgm ignores special characters while calculating trigram.

The following list consists of the features that the extension comes with, which helps in doing trigram searches:

  • similarity(text1, text2) – Calculates the similarity index between text1 and text2 in the scale of 0 to 1, with 0 being the least similar.

  • show_trgm(text)– Lists the possible trigrams that could be calculated from the given text, like we did above.

  • show_limit() – The set filter used by the % method (look below). Similarity indices above this limit are only shown while performing a trigram search. The default limit is 0.3.

  • set_limit(real) – Sets the limit to be used by % method.

  • text1 % text2 – Returns true if the similarity between text1 and text2 is above the set limit.

  • text1 <-> text2 – Distance operator, which is the inverse of similarity. Returns the distance between text1 and text2.

  • gist\_trgm\_ops and gin\_trgm\_ops– Builds the GiST or GIN index, respectively, over a text column for faster similarity search.

Let’s get started with implementing a trigram search in a Rails app.

Implementing a Trigram in Rails

Our sample app is going to be very simple with only one model, Post, which has two columns title and content. Let’s quickly create the app, model, and controller using the commands below. I am using Rails 4.2:

rails new app_name -d postgresql
cd app_name
rails generate model post title content
rake db:create && rake db:migrate
rails generate controller posts index

Seed the database with some fake data. I’m using the Faker gem. Below is the seed file:

(0..100).each do |p|
  Post.create(title: Faker::Commerce.product_name, content: Faker::Company.catch_phrase)
  puts "Created #{p}"
end

Let’s also add some basic content to the controller:

def index
  if params[:q].present?
    @posts = Post.where(title: params[:q])
  else
    @posts = Post.all
  end
end

In the app/views/post/index.html.erb file, add the below lines, which includes a basic search box along with the list of all the posts:

<form method="GET" action="/">
  <input placeholder="Search" id="search" type="text" name="q" />
  <input type="submit">
</form>

<table>
  <tr>
    <th>Title</th>
    <th>Content</th>
  </tr>
  <% @posts.each do |post| %>
    <tr>
      <td><%= post.title %></td>
      <td><%= post.content %></td>
    </tr>
  <% end %>
</table>

We now have a basic application with a single model, 100 rows of posts, an index page, and a search option that matches only the full title of the post. Let’s plug a trigram search into it.

Install the pg_trgm Extension

As mentioned before, Postgres provides trigram functionality via the pg_trgm extension . Install it in the app using a migration instead of doing it directly in the psql console. Create a migration using the below command:

rails generate migration AddPgTrgmExtensionToDB

Add the following to the migration:

execute "create extension pg_trgm;"

Run the migration. This will install the pg_trgm extension in Postgres.

Add a Search Index

When we’re at it, let’s also add an index to the column that we are going to search. GiST (Generalized Search Tree) and GIN (Generalized Inverted Index) are two kinds of indices in Postgres. Adding an index is not mandatory, but desirable to speed up queries. At this point, I really can’t recommend GiST or GIN since I’ve had varying performance differences between them in the past. Primary differences between the two indexes and how to choose one can be found here. Add whichever works for you best.

Create a migration and add the below line to it:

add_index :posts, :title

Run the migration and that’s it! We’re all set on the database side. Quickly add the search query to make use of the trigram similarity.

Search Method

To add the search option, add a method in our Post model:

class Post < ActiveRecord::Base
  def self.text_search(query)
    self.where("similarity(title, ?) > 0.3", query).order("similarity(title, #{ActiveRecord::Base.connection.quote(query)}) DESC")
  end
end

Let’s also replace the search line in our controller from

@posts = Post.where(title: params[:q])

to

@posts = Post.text_search(params[:q])

That’s it. Start the server, then go and search with typos and see the magic of similar words showing up.

app

In the text_search method in post.rb, the threshold score is set to 0.3. Feel free to tune this to meet your requirements. The more the threshold, the less results and the stricter the searches.

One way of improving the speed of the search is by having a separate column that will hold all the trigram sequences of the title column. Then, we can perform the search against the pre-populated column. Or we can make use of ts_vector, but that would become useless with fuzzy words.

Conclusion

There is a gem called pg_search that provides the trigram search functionality out of the box, but for some reason the trigram search from this gem is slower for me than the raw SQL. I may cover this gem in the future.

All the code used for sample in this is hosted on GitHub. Feel free to fork an play with it.

Thank you for reading. I hope you find this useful in your Rails application development.

CSS Master, 3rd Edition