Article: Turning a Crawled Website into a Search Engine with PHP

An excerpt from, by @swader

In the previous part of this tutorial, we used Diffbot to set up a crawljob which would eventually harvest SitePoint’s content into a data collection, fully searchable by Diffbot’s Search API. We also demonstrated those searching capabilities by applying some common filters and listing the results.

In this part, we’ll build a GUI simple enough for the average Joe to use it, in order to have a relatively pretty, functional, and lightweight but detailed SitePoint search engine. What’s more, we won’t be using a framework, but a mere total of three libraries to build the entire application.

You can see the demo application here.

This tutorial is completely standalone, and as such if you choose to follow along, you can start with a fresh Homestead Improved instance. Note that in order to actually fully use what we build, you need a Diffbot account with Crawljob and Search API functionality.


Moving on, I’ll assume you’re using a Vagrant machine. If not, find out why you should, then come back.

On a fresh Homestead Improved VM, the bootstrapping procedure is as follows:

composer global require beelab/bowerphp:dev-master
mkdir sp_search
cd sp_search
mkdir public cache template template/twig app
composer require swader/diffbot-php-client
composer require twig/twig
composer require symfony/var-dumper --dev

In order, this:

installs BowerPHP globally, so we can use it on the entire VM.
creates the project’s root folder and several subfolders.
installs the Diffbot PHP client, which we’ll use to make all calls to the API and to iterate through the results.
installs the Twig templating engine, so we’re not echoing out HTML in PHP like peasants :slight_smile:
installs VarDumper in dev mode, so we can easily debug while developing.
To bootstrap the “front end” part of our app, we do the following:

cd public
mkdir assets assets/{css,js,img}
bowerphp install bootstrap
bowerphp install normalize.css
touch assets/css/main.css assets/js/main.js index.php token.php

I also used iconifier to generate some icons, and grabbed a big SitePoint logo image to use as the site’s background, but that’s all entirely optional.

The above commands make some folders and blank files and install Bootstrap. They also create the front controller (index.php) of our little search app. We can set up this file like so:

use SitePoint\Helpers\SearchHelper;
use Swader\Diffbot\Diffbot;
require_once '../vendor/autoload.php';
require_once '../token.php';
$loader = new Twig_Loader_Filesystem(__DIR__ . '/../template/twig');
$twig = new Twig_Environment($loader
   , array('cache' => false, 'debug' => true)
$vars = [];
// Get query params from request
parse_str($_SERVER['QUERY_STRING'], $queryParams);
// Check if the search form was submitted
if (isset($queryParams['search'])) {
    $diffbot = new Diffbot(DIFFBOT_TOKEN);
    // Building the search string
    $string = '';
    // Basics
    $search = $diffbot
    // Pagination
    // ...
echo $twig->render('home.twig', $vars);

Essentially, we set up Twig, grab the $_GET contents, and initialize a Diffbot search call (but never execute it). Finally, we make the template file template/twig/home.twig:


If you try to run this “app” now, you should see “Hello”. You should also see a cached version of the template appear in the cache folder. Be sure to set up the token.php file first – it needs the contents:

define('DIFFBOT_TOKEN', 'my_token');

Then, we add this file to the project’s .gitignore file. Feel free to use this one and update it as needed. This is so we don’t accidentally commit our Diffbot token to Github – a stolen token can become very expensive.

Bootstrapping done, let’s get to the meat of things now.

Front end

The idea (at this point) is to have one main search field, like Google, accepting almost raw Search API queries, and three plain old text fields into which users can enter comma separated values:

  • “Author(s)” will support authors. Entering several will do an “OR” search – as in, articles written either by author 1, or author 2, or author 3, etc…

  • “Keywords (any)” will search for any of the given keywords in any of the Diffbot-extracted fields. This includes body, title, meta, even author, etc.

  • “Keywords (all)” searches for keywords, too, but those must all appear across any of the Diffbot extracted fields.
    Let’s update our home.twig file, inspired by HTML5 boilerplate.

    <!doctype html>

    SitePoint Search
     <link rel="apple-touch-icon" href="/apple-touch-icon.png">
     <link rel="stylesheet" href="/bower_components/normalize.css/normalize.css">
     <link rel="stylesheet"
     <link rel="stylesheet" href="/assets/css/main.css">

    SitePoint search

     <div class="search-form">
         <form id="main-form" class="submit-once">
             <div class="main-search form-group">
                 <div class="input-group">
                     <input class="form-control" type="text" name="q" id="q"
                            placeholder="Full search query"/>
                 <span class="input-group-btn">
                     <button class="btn btn-default" type="button"
                 <a href="#" class="small detailed-search">>> Toggle Detailed
             <div class="detailed-search-group" style="display: none;">
                 <div class="form-group">
                     <label for="authorinput">Author(s): </label><input
                             placeholder="Bruno Skvorc"/>
                 <div class="form-group">
                     <label for="kanyinput">Keywords (any): </label><input
                             placeholder="sitepoint, diffbot, whatever"/>
                 <div class="form-group">
                     <label for="kallinput">Keywords (all): </label><input
                             placeholder="sitepoint, diffbot, whatever"/>
                     <a href="#" class="small detailed-search">>> Toggle Detailed
             <div class="form-group">
                 <input id="submit" class="btn btn-default" type="submit"
                        value="Search" name="search"/>
         {% include 'results.twig' %}
     <script src="/bower_components/jquery/dist/jquery.min.js"></script>
     <script src="/bower_components/bootstrap/dist/js/bootstrap.min.js"></script>
     <script src="/assets/js/main.js"></script>
     {% include 'google-analytics.twig' %}
    What's this all about?
    Built by @bitfalls for SitePoint. Hosted on DigitalOcean.

    {% include “modal-examples.twig” %}

Note that I also extracted some tedious bits of HTML into sub-templates that get included. Those include the Google Analytics snippet, the modal with search query examples, and most importantly, the results template which we’ll use to output results later. Only the results one is important, so make the file template/twig/results.twig, even if it’s empty or just has the contents “Test”. The others can be removed from the home.twig template altogether, or you can grab them from the Github repo.

Let’s now add to the whole thing a little bit of CSS flexbox magic, background imagery, and basic jQuery-isms to make the elements get along nicely. For example, we use a form class to prevent double submits, and we also use localStorage to remember if the user prefers detailed or regular searching:

// main.js
$(document).ready(function () {
        if( $(this).hasClass('form-submitted') ){
    var dsg = $('.detailed-search-group');
    var ms = $('.main-search');
    if (localStorage.getItem('detailed-on') == "true") {;
    } else {
    $(".detailed-search").click(function (e) {

/* main.css */
body {
    display: flex;
    min-height: 100vh;
    flex-direction: column;
    font-family: arial,sans-serif;
div.content {
    display: flex;
    flex: 1;
    align-items: center;
    justify-content: center;
div.content.what {
    max-width: 500px;
    margin: auto;
div.hidden {
    display: none;
} {
    width: 80%;
.results {
    max-width: 600px;
    font-size: small;
footer {
    padding: 1.5rem;
    background: #404040;
    color: #999;
    font-size: .85em;
    text-align: center;
    z-index: 1;
header {
    text-align: center;
} {
    /* Set rules to fill background */
    min-height: 100%;
    min-width: 1024px;
    /* Set up proportionate scaling */
    width: 100%;
    height: auto;
    /* Set up positioning */
    position: fixed;
    top: -60px;
    left: 0;
    z-index: -1000;
    opacity: 0.05;
    filter: alpha(opacity=5);
@media screen and (max-width: 1024px) { /* Specific to this particular image */ {
        left: 50%;
        margin-left: -512px;   /* 50% */

and we have our basic interface (with the “Test” from a mocked results.twig):

There is one main search field, similar to Google, which accepts any keyword or phrase constructed in a Search API friendly way. Think of it like direct access to the Search API. See the examples modal for what it’s about.

By clicking on “Toggle Detailed”, however, the situation changes and we have our individual search fields with which we can get more precise results. Let’s wire these fields up now.

Continue reading this article on SitePoint!

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.