Using Solarium with SOLR for Search – Setup
Apache’s SOLR is an enterprise-level search platform based on Apache Lucene. It provides a powerful full-text search along with advanced features such as faceted search, result highlighting and geospatial search. It’s extremely scalable and fault tolerant.
Well known websites said to use SOLR to power their search functions include digg, Netflix, Instagram and Whitehouse.gov (source).
While SOLR is written in Java, it’s accessible via HTTP, making it possible to integrate with whatever programming language you prefer. If you’re using PHP then the Solarium Project makes integration even easier, providing a level of abstraction over the underlying requests which enables you to use SOLR as if it were a native implementation running within your application.
In this series, I’m going to introduce both SOLR and Solarium side-by-side. We’ll begin by installing and configuring SOLR and creating a search index. Then, we’ll look at how to index documents. Next, we’ll implement a basic search and then expand it with some more advanced features such as faceted search, result highlighting and suggestions.
As we go along, we’re going to build a simple application for searching a collection of movies. You can grab the source code here, or see an online demo here.
Basic Concepts and Operation
Before we delve into the implementation, it’s worth looking at a few basic concepts and an overall view of what will happen.
SOLR is a Java application which runs as a web service, typically in a servlet container such as Tomcat, Glassfish or JBoss. You can manipulate and query it over HTTP using XML, JSON, CSV or binary format – so you can use any programming language for your application. However, the Solarium library provides a level of abstraction, allowing you to call methods as if SOLR were a native implementation. For the purposes of this tutorial we’ll be running the SOLR on the same machine as our application, but in practice it could be located on a separate server.
SOLR creates a search index of documents. Often that mirrors what we might consider a document in real-life; an article, blog post or even a full book. However a document can also represent any object applicable to your application – a product, a place, an event – or in our example application, a movie.
At its most basic, SOLR allows you to perform full text searches on documents. Think search engines; you’ll typically search for a keyword, a phrase or a full title. You can only get so far with SQL’s
LIKE clause; that’s where fulltext search comes in.
You can also attach additional information to an indexed search document that doesn’t necessarily get picked up by a text-based search; for example, you can incorporate the price of a product, the number of rooms in a property or the date an item was added to the database.
Facets are one of the most useful features of SOLR. You’ll probably have seen faceted search if you’ve ever shopped online; facets allow you to “drill down” search results by applying “filters”. For example, having searched an online bookstore you might use filters to limit the results to those books by a particular author, in a particular genre or in a particular format.
An instance of SOLR runs with one or more cores. A core is a collection of configuration and indexes, each with its own schema. Typically, a single instance is specific to a particular application. Since different types of content can have very different structures and information – for example, consider the difference between a product, an article and a user – an application often has multiple cores within an SOLR instance.
I’m going to provide instructions for how to setup SOLR on a Mac; for other operating systems, consult the documentation – or alternatively, you can download Blaze, an appliance with SOLR pre-installed.
The easiest way to install SOLR on a Mac is to use Homebrew:
brew update brew install solr
This will install the software in a directory such as
/usr/local/Cellar/solr/4.5.0, depending on what version of the software you’re using.
To start the server using the provided Java archive (JAR):
cd /usr/local/Cellar/solr/4.5.0/libeexec/example java -jar start.jar
To verify that the installation is successful, try accessing the admin interface in your web browser:
If you see an admin dashboard with the Apache SOLR logo top-left, the server is up and running.
TIP: to stop SOLR – which you’ll need to do whenever you change the configuration, as we’re about to do shortly – simply press
CTRL + C.
(Linux instructions: http://www.lullabot.com/blog/article/installing-solr-use-drupal)
Setting Up the Schema
Probably the easiest way to get started with SOLR is to copy the default directory, then customize it.
solr directory from
libexec/example; here, we’re creating a new SOLR core called “movies”:
cd /usr/local/Cellar/solr/4.5.0/libeexec/example cp -R solr movies
We’ll look at the configuration files,
movies\collection1\conf\solrconfig.xml later on. For now, what we’re really interested in is the schema, which defines the fields on the documents we’re indexing, along with how they’re handled.
The file that defines this is
If you open up the one you’ve just copied over you’ll see that it not only contains some useful defaults, but it’s also extensively commented to help you understand how to customize it.
The schema configuration file is responsible for two primary aspects; fields and types. Types are simply data types, and under the hood they map type names – such as integers, dates and strings – to the underlying Java classes used in the implementation. For example:
solr.TextField. The types configuration also defines behavior of tokenizers, analyzers and filters.
Here are some examples of basic types:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" /> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
string type warrants a closer look, because there’s a gotcha there. When you use a field as a string, then any data gets stored exactly as you enter it. Furthermore, in order for a query to match it, it must be identical. For example, suppose you had an article title as a string, and inserted a document entitled “An Introduction to SOLR”. In any proper search implementation, you’d expect to find the article with a query such as “SOLR introduction” – not to mention “an introduction to Solr”. To get around this, if you don’t want this exact match behavior – which actually is useful in some cases, such as faceted search, then you can use a combination of tokenizers and filters.
Tokenizers split text into chunks – usually separate words. Filters transform text in some way. To illustrate, let’s look at a sensible default for text:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
First, you’ll notice that we’re defining behavior at indexing time – in other words, how data is transformed when you add a document – and at query time. In this example, the
LowerCaseFilterFactory converts data to lower case both as it’s indexed and when it’s queried, so capitalization becomes irrelevant and we can do a like-for-like comparison. In our example, “introduction” will match “Introduction”, and “SOLR” will match “Solr”.
StopFilterFactory is used to strip out stop words, which are common words which are excluded either because they’re not relevant to search, or for efficiency – words such as “a”, “the”, “and”, “etc”. There’s a good, exhaustive list of stop words here. In the code above the stop words are configured in a separate text file.
fields section is used to define the available fields, their types and additional information such as whether they have multiple values, if they should be indexed and more.
We’re not going to try modifying or extending the types definitions – that’s outside the scope of this tutorial – but instead what we’re going to look at are the fields definitions.
Broadly speaking, there are two approaches to defining the structure of your documents. The first is to explicitly define all the possible fields. The second is to use dynamic fields, which enable you to add fields on-the-fly providing you adhere to certain naming conventions. For example, given the following dynamic field definition:
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
…you could add, say, a single value property named
author_s to the document and it will be stored as a string, without having pre-configured it. (Note: the “author” and “s” parts are entirely separate, so don’t read it as the plural “authors”.) The following defines a multi-valued string field:
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
…so, for example, you could add categories by using
If you attempt to add a document to the search index which contains properties which haven’t been explicitly defined – or, if you’re using dynamic properties and don’t adhere to these conventions – then SOLR will produce an error. To alter this behavior, locate and uncomment (or add) the following line:
<dynamicField name="*" type="ignored" multiValued="true" />
This line indicates that properties which haven’t been previously defined should be silently ignored instead of generating an error. Because they’re ignored, however, note that they won’t be indexed nor stored – so they won’t have any impact on any searches you might run.
For the purposes of this tutorial, we’re going to explicitly define the fields we want.
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="title" type="text_general" indexed="true" stored="true"/> <field name="synopsis" type="text_general" indexed="true" stored="true" omitNorms="true"/> <field name="rating" type="string" indexed="true" stored="true" /> <field name="cast" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="year" type="int" indexed="true" stored="true" /> <field name="runtime" type="int" indexed="true" stored="true" />
The following line is required by SOLR:
<field name="_version_" type="long" indexed="true" stored="true"/>
The following field isn’t used; however, because there are a number of references to it in
solrconfig.xml, it’s a good idea to leave it in for now:
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
We also need to specify which field is the unique identifier – think primary key in SQL terminology – as follows:
Now we need to tell SOLR which configuration to use. Stop the server if it’s currently running (CTRL+ C), and this time run it with the
cd /usr/local/Cellar/solr/4.5.0/libeexec/example java -jar start.jar -Dsolr.solr.home=movies
That’s it for the first part, where we’ve started to look at SOLR and Solarium. We’ve got SOLR installed, and a schema set up. In the next part we’ll set up our application along with Solarium and index some data.