Awesome Autocomplete: Trigram Search in Rails and PostgreSQL
PostgreSQL mostly known as Postgres, is one of the most mature and robust databases out there. It’s a multi-platform, open source and the second most used open source DBMS.
Today we will see about how to implement a basic autocomplete search using a trigram in Postgres and Rails. The tutorial is split into three parts:
- What is a trigram?
- Trigram in Postgres
- Implementing trigram in a sample rails app
What is a Trigram?
A trigram is nothing but an n-gram with three letter sequence. So, what is an n-gram? From wikipedia,
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
Well, what does that mean, exactly? It means finding matching words by maintaining variable sequences of characters in a word.
In a trigram the variable sequence in n-gram is 3. For finding the similarity between two words, wordA and wordB, wordA is split into three letter sequences and compared with the three letter sequence combinations computed from wordB. The comparison aims to find the number of shared sets between the two words. The more number of sequence matches means the high similarity between the words. This becomes very useful in terms of autocompletion.
Each word is treated with two spaces prefixed and one space suffixed to balance the number of trigrams for a n-character word. That’s confusing, so let’s have an example.
Assume we have a word group that consists of three words [google, toddle, beagle]
and the search term is googlr. We need to find the best matching word from the batch for the search term. First the batch words are split with three letter groups:
google - g, go, goo, oog, ogl, gle, le
toddle - t, to, tod, odd, ddl, dle, le
beagle - b, be, bea, eag, agl, gle, le
The trigram groups of three letters will be calculated for the search term and compared to the words in for the batch for the sequences they share:
g, go, goo, oog, ogl, glr, lr
google - 5
toddle - 0
beagle - 0
The similarity is calculated using the number of trigrams they share, which in our case is quite trivial. So google
will top the similarity index.
For the second use case, let’s say the search term is just gle
. The trigram is:
g, gl, gle, le
Matches -
google - 3
toddle - 1
beagle - 2
For this, google
will top the result followed by beagle
. Once you are comfortable with this concept, we can move on to how trigrams are implemented in Postgres.
Trigram in PostgreSQL
Postgres supports trigrams through an extension called pg_trgm which is an officially supported extension. It’s worth noting that pgtrgm ignores special characters while calculating trigram.
The following list consists of the features that the extension comes with, which helps in doing trigram searches:
-
similarity(text1, text2)
– Calculates the similarity index betweentext1
andtext2
in the scale of 0 to 1, with 0 being the least similar. -
show_trgm(text)
– Lists the possible trigrams that could be calculated from the giventext
, like we did above. -
show_limit()
– The set filter used by the%
method (look below). Similarity indices above this limit are only shown while performing a trigram search. The default limit is 0.3. -
set_limit(real)
– Sets the limit to be used by%
method. -
text1 % text2
– Returns true if the similarity betweentext1
andtext2
is above the set limit. -
text1 <-> text2
– Distance operator, which is the inverse of similarity. Returns the distance betweentext1
andtext2
. -
gist\_trgm\_ops
andgin\_trgm\_ops
– Builds the GiST or GIN index, respectively, over a text column for faster similarity search.
Let’s get started with implementing a trigram search in a Rails app.
Implementing a Trigram in Rails
Our sample app is going to be very simple with only one model, Post
, which has two columns title
and content
. Let’s quickly create the app, model, and controller using the commands below. I am using Rails 4.2:
rails new app_name -d postgresql
cd app_name
rails generate model post title content
rake db:create && rake db:migrate
rails generate controller posts index
Seed the database with some fake data. I’m using the Faker gem. Below is the seed file:
(0..100).each do |p|
Post.create(title: Faker::Commerce.product_name, content: Faker::Company.catch_phrase)
puts "Created #{p}"
end
Let’s also add some basic content to the controller:
def index
if params[:q].present?
@posts = Post.where(title: params[:q])
else
@posts = Post.all
end
end
In the app/views/post/index.html.erb file, add the below lines, which includes a basic search box along with the list of all the posts:
<form method="GET" action="/">
<input placeholder="Search" id="search" type="text" name="q" />
<input type="submit">
</form>
<table>
<tr>
<th>Title</th>
<th>Content</th>
</tr>
<% @posts.each do |post| %>
<tr>
<td><%= post.title %></td>
<td><%= post.content %></td>
</tr>
<% end %>
</table>
We now have a basic application with a single model, 100 rows of posts, an index page, and a search option that matches only the full title of the post. Let’s plug a trigram search into it.
Install the pg_trgm Extension
As mentioned before, Postgres provides trigram functionality via the pg_trgm
extension . Install it in the app using a migration instead of doing it directly in the psql console. Create a migration using the below command:
rails generate migration AddPgTrgmExtensionToDB
Add the following to the migration:
execute "create extension pg_trgm;"
Run the migration. This will install the pg_trgm extension in Postgres.
Add a Search Index
When we’re at it, let’s also add an index to the column that we are going to search. GiST (Generalized Search Tree) and GIN (Generalized Inverted Index) are two kinds of indices in Postgres. Adding an index is not mandatory, but desirable to speed up queries. At this point, I really can’t recommend GiST or GIN since I’ve had varying performance differences between them in the past. Primary differences between the two indexes and how to choose one can be found here. Add whichever works for you best.
Create a migration and add the below line to it:
add_index :posts, :title
Run the migration and that’s it! We’re all set on the database side. Quickly add the search query to make use of the trigram similarity.
Search Method
To add the search option, add a method in our Post
model:
class Post < ActiveRecord::Base
def self.text_search(query)
self.where("similarity(title, ?) > 0.3", query).order("similarity(title, #{ActiveRecord::Base.connection.quote(query)}) DESC")
end
end
Let’s also replace the search line in our controller from
@posts = Post.where(title: params[:q])
to
@posts = Post.text_search(params[:q])
That’s it. Start the server, then go and search with typos and see the magic of similar words showing up.
In the text_search
method in post.rb, the threshold score is set to 0.3
. Feel free to tune this to meet your requirements. The more the threshold, the less results and the stricter the searches.
One way of improving the speed of the search is by having a separate column that will hold all the trigram sequences of the title column. Then, we can perform the search against the pre-populated column. Or we can make use of ts_vector
, but that would become useless with fuzzy words.
Conclusion
There is a gem called pg_search that provides the trigram search functionality out of the box, but for some reason the trigram search from this gem is slower for me than the raw SQL. I may cover this gem in the future.
All the code used for sample in this is hosted on GitHub. Feel free to fork an play with it.
Thank you for reading. I hope you find this useful in your Rails application development.