By Dummy_Author

SitePoint Search Now Powered by Google. Should You Use Them Too?

By Dummy_Author

Regular visitors to SitePoint may have noticed that for a while, our search functionality hasn’t quite been up to scratch. Good news everyone! Those days are over.

The previous system was put together in a time when rolling your own search was one of a very limited number of options for a site like SitePoint (indeed, for many organizations with slightly different use cases, that’s still true). I don’t want to diminish the efforts of my predecessor developers here at SitePoint — they did a great job of stitching various SitePoint properties together using Apache Solr. Unfortunately, the rest of our codebase has marched on while the search functionality has stagnated.

Time for an update.

There is one undisputed king of search — Google. Rather than creating our own search engine — and implementing algorithms to rank content based on keywords, index all that content (there’s quite a bit of it!) and serve up results, we’ve partnered with Google to serve up search results via Google Site Search.

Google Site Search is a facility allowing you to create your own Google-powered and Google-hosted search engine (replete with refinements and promotions) and pull down the results to your own site as an XML document. You can then style the results or manipulate them in all sorts of fun ways (we pull the results down primarily to style them, but you could do some very neat presentational mojo if you felt so inclined).

If you don’t want to style the results in any special way, you can also use Google Custom Search Engine, the engine that powers Google Site Search. However, Site Search has some restrictions (and loses some others) compared to CSE. The differences between the two products are fairly nebulous — I’ve yet to find a blow-by-blow comparison between the two. The key points as I’ve uncovered them are:

Google Search Engine

  • For businesses who want more control over how search results are produced and displayed
  • Has a price tag attached (US$250 — US$2000+ per year, depending on your needs)
  • Doesn’t have to have Google branding or ads displayed
  • On-demand indexing
  • No option to host your own Context file (see below)

Google Custom Search Engine

  • For small businesses, not-for-profits and enthusiasts who want good search at low-to-zero cost
  • Completely free (within reasonable limits)
  • Must have Google branding and ads
  • Indexing at Google’s whim
  • You can host your own Context file (see below)

There were two big gotchas that had me chasing my tail while I was working on SitePoint Search and they’re closely related.

Firstly, Google’s documentation for CSE and Site Search is patchy. It’s seriously rough. There’s some features and restrictions that are, as far as I can tell, alluded to only on one line in the depths of an article on something else in the documentation. Other bits are seemingly only explained on the Custom Search blog. I’d even go so far as to say that some of the documentation on these products absolutely sucks. Google, please, clean up the documentation!

As a result of missing one line in the documentation, I didn’t realise that Google doesn’t allow Site Search customers to host their own Context files. These are XML files that specify what the search engine is to search, what to label sub-sections of that content, and how to weight some of those sections above others (among other useful options).

We wanted to weight, for example, our articles a little bit above our blogs on a typical search. If we had the ability to host our own Context file, this would be trivially simple. Alas!

The alternative was for us to generate a Context file and have a cron job post that up to the control panel. Kludgy, but testing showed it to be effective. Unfortunately we had some other weird special cases where we wanted to promote some results in special ways that conflicted with Google’s standard model, and this whole plan fell apart.

Instead, we’ve opted to do some (very light) parsing on search results once they hit our own servers to emphasise some results over others (it’s a very slight effect, and only for some arbitrary search terms, but it’s there).

I also discovered — mostly by trial and error — the more: and less: keywords. I can’t find much documentation on these anywhere (there’s one reference to ‘more:’ in the Custom Search Engine docs, and none to ‘less:’). I wrote a bit about it on my own blog, but the gist of it is — if you have refinement labels on specific parts of your site, you can return results only for selected refinements using ‘less:<refinementName>’. That’s how our select boxes for articles, blogs, products, etc. work. It’s effectively applying a BOOST +1.0 or BOOST -1.0 BackgroundLabel to a given subset of results, as near as I can determine.

Overall, I’d highly recommend Google’s search engine offerings to anyone needing to implement search for their own site — with the caveat that the documentation is going to baffle you in places, so you might have to ask a few questions to get things working the way you want. Google’s Search is extremely powerful and right darn complex — I still haven’t quite got to grips with it myself. It is, however, extremely powerful, lightning-fast, and insanely cheap for what you get.

Just to finish things off, we’ve also added OpenSearch links to most of; if your browser supports OpenSearch (like Firefox and IE, for example!) you should be able to add SitePoint as a plug-in search engine, and quickly and easily search through our content with just a few clicks. (Thanks to Louis from Publishing for the suggestion on that one!)

I hope you enjoy SitePoint’s new search functionality. Our own testing has shown that it’s now much easier to find useful and relevant information on, and I’d love to hear feedback from the community. We’ll be revisiting search again in the future and adding some new features — if you have any suggestions, I’d love to hear those too!

  • astrotim

    I agree about the patchiness. I have tried Google CSE on a client ecommerce site and noticed that it is hit and miss returning results. There are pages that I can confirm are indexed using the “site: ” search in the regular Google search engine but they don’t show up in CSE. This has prevented me from implementing CSE on the client site. Another problem is waiting for the googlebot to come past and index new content, which is a a problem when there are new products that you want site visitors to be able to find via a search on your own site once they are available for sale.

    I’m sure Sitepoint enjoys a fast rate of indexing, however I am interested to know if these problems affected your CSE and if so how you worked around them.

    • Andy White

      It’s known that the Google Search and CSE results are different – I don’t think there’s a lot that can be done about that. I suspect Google don’t want their ‘actual’ search algorithm being made vulnerable to analysis and gaming, so CSE uses a different ranking system. That said, all the content I’ve told CSE to index has been indexed – in tests, I can reasonably expect to find pretty much anything provided I know the right search terms (with the current exception being the site-wide ‘About Us’ and ‘Contact Us’ type pages, but I’ll get to those).

      If the results aren’t showing up on CSE – have you provided a sitemap.xml file? That can dramatically boost coverage of your search results!

      As for waiting on the Googlebot – that’s just the way this particular cookie crumbles, I fear. As I say in the article, the paid account can get you close to on-demand indexing, especially with a good sitemap.

      Have you gone with an alternative search engine on this client site?

  • astrotim

    That makes sense about why Google would use an alternate search engine for CSE. The Google algorithm surely is more valuable now than the Coca Cola recipe and the Colonel’s secret herbs and spices put together.

    My client’s site is running a very limited and outdated CMS to which I have limited access, so the sitemap.xml is not in the root directory and the existing search engine leaves much to be desired, thus me looking to Google CSE.

    I am close to getting them to move to a full featured ecommerce platform with a good search engine, so these problems will hopefully soon be a fading memory.

  • Qasim Zeeshan

    Nice article. I have two problems.

    1. I am trying to upload a context file but if its size is more than 500KB, it is not getting uploaded.

    2. Please help me about the references you guys used to write the cron job that automatically updates context file.


  • qasimzee


    I am also using Google Site Search but I am unable to understand how to find the exact number of results on the first page.

    What technique the Site Point guys have used in order to find the exact number of pages?


Get the latest in Front-end, once a week, for free.