Regular visitors to SitePoint may have noticed that for a while, our search functionality hasn’t quite been up to scratch. Good news everyone! Those days are over.
The previous system was put together in a time when rolling your own search was one of a very limited number of options for a site like SitePoint (indeed, for many organizations with slightly different use cases, that’s still true). I don’t want to diminish the efforts of my predecessor developers here at SitePoint — they did a great job of stitching various SitePoint properties together using Apache Solr. Unfortunately, the rest of our codebase has marched on while the search functionality has stagnated.
Time for an update.
There is one undisputed king of search — Google. Rather than creating our own search engine — and implementing algorithms to rank content based on keywords, index all that content (there’s quite a bit of it!) and serve up results, we’ve partnered with Google to serve up search results via Google Site Search.
Google Site Search is a facility allowing you to create your own Google-powered and Google-hosted search engine (replete with refinements and promotions) and pull down the results to your own site as an XML document. You can then style the results or manipulate them in all sorts of fun ways (we pull the results down primarily to style them, but you could do some very neat presentational mojo if you felt so inclined).
If you don’t want to style the results in any special way, you can also use Google Custom Search Engine, the engine that powers Google Site Search. However, Site Search has some restrictions (and loses some others) compared to CSE. The differences between the two products are fairly nebulous — I’ve yet to find a blow-by-blow comparison between the two. The key points as I’ve uncovered them are:
Google Search Engine
- For businesses who want more control over how search results are produced and displayed
- Has a price tag attached (US$250 — US$2000+ per year, depending on your needs)
- Doesn’t have to have Google branding or ads displayed
- On-demand indexing
- No option to host your own Context file (see below)
Google Custom Search Engine
- For small businesses, not-for-profits and enthusiasts who want good search at low-to-zero cost
- Completely free (within reasonable limits)
- Must have Google branding and ads
- Indexing at Google’s whim
- You can host your own Context file (see below)
There were two big gotchas that had me chasing my tail while I was working on SitePoint Search and they’re closely related.
Firstly, Google’s documentation for CSE and Site Search is patchy. It’s seriously rough. There’s some features and restrictions that are, as far as I can tell, alluded to only on one line in the depths of an article on something else in the documentation. Other bits are seemingly only explained on the Custom Search blog. I’d even go so far as to say that some of the documentation on these products absolutely sucks. Google, please, clean up the documentation!
As a result of missing one line in the documentation, I didn’t realise that Google doesn’t allow Site Search customers to host their own Context files. These are XML files that specify what the search engine is to search, what to label sub-sections of that content, and how to weight some of those sections above others (among other useful options).
We wanted to weight, for example, our articles a little bit above our blogs on a typical search. If we had the ability to host our own Context file, this would be trivially simple. Alas!
The alternative was for us to generate a Context file and have a cron job post that up to the control panel. Kludgy, but testing showed it to be effective. Unfortunately we had some other weird special cases where we wanted to promote some results in special ways that conflicted with Google’s standard model, and this whole plan fell apart.
Instead, we’ve opted to do some (very light) parsing on search results once they hit our own servers to emphasise some results over others (it’s a very slight effect, and only for some arbitrary search terms, but it’s there).
I also discovered — mostly by trial and error — the more: and less: keywords. I can’t find much documentation on these anywhere (there’s one reference to ‘more:’ in the Custom Search Engine docs, and none to ‘less:’). I wrote a bit about it on my own blog, but the gist of it is — if you have refinement labels on specific parts of your site, you can return results only for selected refinements using ‘less:<refinementName>’. That’s how our select boxes for articles, blogs, products, etc. work. It’s effectively applying a BOOST +1.0 or BOOST -1.0 BackgroundLabel to a given subset of results, as near as I can determine.
Overall, I’d highly recommend Google’s search engine offerings to anyone needing to implement search for their own site — with the caveat that the documentation is going to baffle you in places, so you might have to ask a few questions to get things working the way you want. Google’s Search is extremely powerful and right darn complex — I still haven’t quite got to grips with it myself. It is, however, extremely powerful, lightning-fast, and insanely cheap for what you get.
Just to finish things off, we’ve also added OpenSearch links to most of sitepoint.com; if your browser supports OpenSearch (like Firefox and IE, for example!) you should be able to add SitePoint as a plug-in search engine, and quickly and easily search through our content with just a few clicks. (Thanks to Louis from Publishing for the suggestion on that one!)
I hope you enjoy SitePoint’s new search functionality. Our own testing has shown that it’s now much easier to find useful and relevant information on sitepoint.com, and I’d love to hear feedback from the community. We’ll be revisiting search again in the future and adding some new features — if you have any suggestions, I’d love to hear those too!