Flexible Searching with Solr and Sunspot

Just about every type of datastore has some form of indexing. A typical relational database, such as MySQL or PostreSQL, can index fields for efficient querying. Most document databases, like MongoDB, contain indexing as well. Indexing in a relational database is almost always done for one reason: speed. However, sometimes you need more than just speed, you need flexibility. That’s where Solr comes in.

In this article, I want to outline how Solr can benefit your project’s indexing capabilities. I’ll start by introducing indexing and expand to show how Solr can be used within a Rails application.

What is an Index?

Indexing data is a very old concept. It far predates relational databases and computers entirely. Index cards have been used in a wide variety of situations, especially in library catalogs. Librarians index their books using a number of techniques, one of which is alphabetization. A simple index could be contrived by listing all the books that begin with A, then all the books that begin with B, and so on. When searching for a book, say “A Tale of Two Cities”, you look at the first letter of the title, “A”, and then jump directly to the “A” section in your library index.

The purpose of indexing, no matter where it’s applied, always pertains to organizing data so it can be extracted quickly. You can imagine organizing the library catalog by genres, in which case “A Tale of Two Cities” might fall into “Fiction”. The librarian would then jump directly to the “Fiction” section.

Relational Database Indexing

All production-ready relational database systems contain indices. Frequently you want to index by a foreign key so that querying for that foreign key can be done efficiently. This can dramatically increase performance when performing joins, for instance. Both MySQL and PostgreSQL support “full-text indexing” which allows you to query against a large block of text for bits and pieces contained therein.

If you’re just looking to have a simple search box on your site, full-text indexing using your already-existing relational database might be the way to go. It has two major advantages: you’re working within the same tool and your indices are always up-to-date. It has one major drawback: it’s not flexible enough to handle “outside the box” indexing situations.

Solr – When Your Relational Database Isn’t Enough

If you stick to a relational database for all your searching needs, you’ll often find yourself creating awkward and inefficient queries. This is a good sign that you’ve reached the limits of what the database can provide. That’s where Solr comes in. It’s designed to augment your existing relational database and provide additional means of querying the data.

Solr is a very mature technology, originally created in 2004 and used by Big Dogs like Netflix and the Internet Archive. Built on Lucene, Solr provides you with a different way to define your indices. Effectively, Solr helps redefine your relational data into a more document-oriented structure efficient for querying.

Here’s how Solr fits into your Rails application:

Sunspot – The Ruby/Solr Love Child

Sunspot is the king of integration between Ruby and Solr. It provides a clean DSL (Domain Specific Language) to define how you want your relational data indexed with Solr. Let’s contrive an example.

Say we have a Product model which has the following attributes:

We have a normal ActiveRecord model to define our database with Ruby:

If we wanted to search for products with a name containing “cities” and expect to see “A Tale of Two Cities”, we might construct the following query:

This isn’t nearly flexible enough in most cases. What if we wanted a search for “cities” to also include “New York: A Big City”? This technique wouldn’t provide us with the results we desire. Let’s introduce Sunspot.

To get started with Sunspot, put it in your Gemfile:

Grab the gems:

Generate a Sunspot configuration file so your Rails app knows where to find the server:

By including the ‘sunspot_solr’ gem in your Gemfile, Sunspot will provide you with a copy of Solr. Start the Solr server by running:

At this point, you should be up and running with Solr. Let’s redefine our Product model. I won’t include validations again, I’ll just pretend they still exist.

What we’ve done in the above model is tell Solr how we want it to index our Product model. We’ve told it to treat the name and color fields as text and the used field as a boolean. Defining something as “text” in Sunspot means that it’s full-text searchable. In other words, when you search against the name and color fields, it will find partial matches (ie a search for “bread” will return “bread and butter”). We’ve also told Sunspot that we want the name field to have twice (2.0) the prevalence as the color field. If a search brings up results by both name and color, the results matching against the name field will be more relevant.

Searching for data in a Solr index usually happens at the controller level. Let’s take a look at how we can query Solr for products matching “cities” that are not used:

This search would return new copies of “A Tale of Two Cities” and any other books with “cities”. The search is case-insensitive by default. Searches can also contain ranges, date comparisons, set includes, greater than and less than queries, and more.

As you can see from the model and search definition, Sunspot provides us with a clean and readable DSL. This is one of my favorite aspects of the library.

Digging Deeper – Solr Configuration

You can accomplish a lot without ever touching the Solr configuration. By default, Solr will break down text fields into their individual words and then convert them to lowercase. This allows the full-text queries to be case-insensitive. The ‘sunspot_solr’ gem gives us a default schema.xml file to use. schema.xml is usually the place you’ll go when you want to configure Solr at a lower level (you might also touch solrconfig.xml). This file usually lives at {RAILS ROOT}/solr/conf/schema.xml. Let’s take a look at how our text fields are being defined:

We have three interesting definitions in the default Solr configuration shipped with Sunspot. Here’s what they do:

  • StandardTokenizerFactory tokenizes our text. In other words, it breaks down our text field into its individual words.
  • StandardFilterFactory provides Solr a means of searching the data by the tokenized text.
  • LowerCaseFilterFactory converts all the tokenized words into their lowercase form.

New filters can be appended to the end of this text field definition. The filters happen sequentially, so filters appended to the end of the list will cascade from the filters before it.

If we wanted our search for “cities” to return the book titled “New York: A Big City”, we would use a stemmer. The goal of any stemmer is to break apart a word into its “stem”. So, then stem of “walked”, “walking” and “walker” would be “walk”. Let’s make our search more robust by defining a Solr stemmer on our text fields:

We’ve now told Solr that we want to stem any text we index after it’s first been tokenized and then converted to lowercase. We have a problem, however. Not all stemmers are intelligent enough to replace the “ies” with a “y” in the case of stemming “cities”. In fact, this job is usually left to a lemmatizer. Stemmers and lemmatizers are both language specific. That is, stemming an English word is much different than stemming a Romanian word, for obvious reasons.

If we tried to stem the word “cities”, what we would actually get is the word “citi”, which is clearly incorrect. Try stemming some different words on this online stemmer. It feels like we’ve hit a rough spot with Solr, and we truely have. Solr doesn’t have a lemmatizer built-in. We could write such a filter but it would be a painstaking task. Possibly, a better option is to use the SynonymFilterFactory.

Digging Even Deeper – Solr Synonyms

Solr has an understanding of synonyms and allows us to define our own. You can configure Solr to return matches on different words, based on its synonyms. Such synonyms are defined in a synonyms.txt file like so:

The above synonyms.txt file tells Solr that we would like to treat the word “citi” as though it is the word “city” and the word “copi” as “copy”.

We now need to put the correct filter in our schema.xml file:

We’ve now told Solr that we would like to consider the stems “citi” and “copi” as their rightful lemmatization, “city” and “copy”. At this point, when a book with a name of “New York: A Big City” is indexed, the following steps happen:

  1. “New York: A Big City” is broken into its tokens: ["New", "York", "A", "Big", "City"]
  2. Each token is converted to lowercase: ["new", "york", "a", "big", "city"]

When a search for “cities” is performed, the following steps happen:

  1. “cities” is broken into tokens (only one token in this case): ["cities"]
  2. Each token is converted to lowercase (no effect in this case): ["cities"]
    • Books matching “cities” are found
  3. Each token is stemmed: ["citi"]
    • Books matching “citi” are found
  4. Each token is checked or synonyms: ["city"]
    • Books matching “city” are found – returning “New York: A Big City”

Wrapping Up

Solr is a phenomenal technology that provides powerful search capabilities. We’ve touched on some history behind indexing and the painpoints of relational database searching. We’ve also looked at how we can utilize Solr in a Rails app using Sunspot and dug deep into Solr configuration to show how to handle a tough edgecase. But we’ve only scratched the surface. One of the most powerful features of Solr is faceting, the concept of breaking apart your index into hierarchical chunks for which you can drill into to find relevant results. Usually, as you drill into a category (facet), more categories are exposed to show deeper layers of facets. Sunspot handles faceting with finesse. Newegg and Amazon both exhibit great uses of faceting when exploring categories on the left navigation.

I hope this article has intrigued you by exposing some of the deeper features of Solr. There’s a lot to learn and taking it step-by-step is always the best approach. I encourage you to get comfortable with Solr so you can handle complex search queries with ease.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Brutuscat

    I would recommend to use see WordDelimiterFilterFactory when indexing to aviod issues with apostrophes https://github.com/sunspot/sunspot/pull/184

  • lda

    Did you try more accurate stemmers like solr.EnglishMinimalStemFilterFactory or solr.HunspellStemFilterFactory (it all available in sunspot ~> 2.0.pre)? I think that supporting huge file of synonyms (with incorrect words) is completely wrong idea.

    • http://www.mikepackdev.com/ Mike Pack

      Hey Ida. You’re entirely correct. Perhaps this article should have had a disclaimer. My goal was to introduce Solr config to Rubyists, not Ruby to Solrists. Discussing the difference between stemmers to handle lemmatization would have not only been dry but tangential to the topic. I introduced the synonym filter to show another Solr feature at a high level in lue of diving deep into stemmers. I could have also discussed how to adjust the rules of the stemmers to handle this situation.

      Thanks for bringing this up. I share your concerns around unwieldy synonym files (it’s the wrong way to handle this problem, but that’s not the point). Anyone looking to solve this problem explicitly should look at the various stemmers who attempt to solve lemmatization in english.

  • http://blog.changebox.me blackanger

    Good Post.

    Could you tell us more about solr and sunspot ? like auto_index and auto_commit.

  • Brian

    Would you still recommend Solr in cases where the simpler search query (like your example of “name LIKE ‘%?%'”) would be sufficient? Or would Solr be overkill in that case? The reason I ask is that I’m wondering about how Solr affects performance, and if there is much of an effect, at what point is it wise to move to using Solr. Thanks.

    • http://www.mikepackdev.com/ Mike Pack

      Hey Brian. My rule of thumb is that Solr should be introduced when either a) your relational performance is degrading or b) you need a single denormalized table to handle some special situation.

      I *highly* recommend starting with “name LIKE ‘%?%'”. If you have 1 million records in your table, you’ll likely notice how slow the query takes. If you have 300 records, this query will probably be quite fast.

      Solr isn’t cheap to run, keep that in mind. You’ll either need to set up and configure your own server or use something like WebSolr. While it won’t break the bank, it’s significant enough to be a consideration.

      In just about every single case, Solr will out perform your database for non-trivial applications. While Solr doesn’t quite provide constant performance O(1), it’s much more scalable than a relational database. So as your app grows, either introduce Solr late in the game when performance is a factor or ahead of time knowing your data needs to scale.

  • swathi

    how can we search by single letter in sunspot