Article: Crawling and Searching Entire Domains with Diffbot

An excerpt from http://www.sitepoint.com/crawling-searching-entire-domains-diffbot/, by @swader

In this tutorial, I’ll show you how to build a custom SitePoint search engine that far outdoes anything WordPress could ever put out. We’ll be using Diffbot as a service to extract structured data from SitePoint automatically, and this matching API client to do both the searching and crawling.

I’ll also be using my trusty Homestead Improved environment for a clean project, so I can experiment in a VM that’s dedicated to this project and this project alone.

What’s what?

To make a SitePoint search engine, we need to do the following:

  1. Build a Crawljob which will index and process the entire SitePoint.com domain and keep itself up to date with newly published content.
  2. Build a GUI for submitting search queries to the saved set produced by this crawljob. Searching is done via the Search API. We’ll do this in a followup post.

A Diffbot Crawljob does the following:

  1. It spiders a URL pattern for URLs. This does not mean processing – it means looking for links to process on all the pages it can find, starting from the domain you originally passed in as seed. For the difference between crawling and processing, see here.
  2. It processes the pages found on the spidered URLs with the designated API engine – for example, using Product API, it processes all products it found on Amazon.com and saves them into a structured database of items on offer.

Creating a Crawljob

Jobs can be created through Diffbot’s GUI, but I find creating them via the crawl API is a more customizable experience. In an empty folder, let’s first install the client library.

composer require swader/diffbot-php-client

I now need a job.php file into which I’ll just dump the job creation procedure, as per the README:

include 'vendor/autoload.php';
 
use Swader\Diffbot\Diffbot;
$diffbot = new Diffbot('my_token');

The Diffbot instance is used to create access points to API types offered by Diffbot. In our case, a “Crawl” type is needed. Let’s name it “sp_search”.

$job = $diffbot->crawl('sp_search');

This will create a new crawljob when the call() method is called. Next, we’ll need to configure the job. First, we need to give it the seed URL(s) on which to start the spidering process:

$job
    ->setSeeds(['http://sitepoint.com'])

Then, we make it notify us when it’s done crawling, just so we know when a crawling round is complete, and we can expect up to date information to be in the dataset.

$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')

A site can have hundreds of thousands of links to spider, and hundreds of thousands of pages to process – the max limits are a cost-control mechanism, and in this case, I want the most detailed possible set available to me, so I’ll put in one million URLs into both values.

$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')
    ->setMaxToCrawl(1000000)
    ->setMaxToProcess(1000000)

We also want this job to refresh every 24 hours, because we know SitePoint publishes several new posts every single day. It’s important to note that repeating means “from the time the last round has finished” – so if it takes a job 24 hours to finish, the new crawling round will actually start 48 hours from the start of the previous round. We’ll set max rounds as 0, to indicate we want this to repeat indefinitely.

$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')
    ->setMaxToCrawl(1000000)
    ->setMaxToProcess(1000000)
    ->setRepeat(1)
    ->setMaxRounds(0)

Finally, there’s the page processing pattern. When Diffbot processes pages during a crawl, only those that are processed – not crawled – are actually charged / counted towards your limit. It is, therefore, in our interest to be as specific as possible with our crawljob’s definition, as to avoid processing pages that aren’t articles – like author bios, ads, or even category listings. Looking for <section class="article_body"> should do – every post has this. And of course, we want it to only process the pages it hasn’t encountered before in each new round – no need to extract the same data over and over again, it would just stack up expenses.

$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')
    ->setMaxToCrawl(1000000)
    ->setMaxToProcess(1000000)
    ->setRepeat(1)
    ->setMaxRounds(0)
    ->setPageProcessPatterns(['<section class="article_body">'])
    ->setOnlyProcessIfNew(1)

Before finishing up with the crawljob configuration, there’s just one more important parameter we need to add – the crawl pattern. When passing in a seed URL to the Crawl API, the Crawljob will traverse all subdomains as well. So if we pass in http://sitepoint.com, Crawlbot will look through http://community.sitepoint.com, and the now outdated http://reference.sitepoint.com – this is something we want to avoid, as it would slow our crawling process dramatically, and harvest stuff we don’t need (we don’t want the forums indexed right now). To set this up, we use the setUrlCrawlPatterns method, indicating that crawled links must start with sitepoint.com.

$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')
    ->setMaxToCrawl(1000000)
    ->setMaxToProcess(1000000)
    ->setRepeat(1)
    ->setMaxRounds(0)
    ->setPageProcessPatterns(['<section class="article_body">'])
    ->setOnlyProcessIfNew(1)
    ->setUrlCrawlPatterns(['^http://www.sitepoint.com', '^http://sitepoint.com'])

Now we need to tell the job which API to use for processing. We could use the default – Analyze API – which would make Diffbot auto-determine the structure of the data we’re trying to obtain, but I prefer specificity and want it to know outright that it should only produce articles.

$api = $diffbot->createArticleAPI('crawl')->setMeta(true)->setDiscussion(false);
$job->setApi($api);

Note that with the individual APIs (like Product, Article, Discussion, etc…) you can process individual resources even with the free demo token from Diffbot.com, which lets you test out your links and see what data they’ll return before diving into bulk processing via Crawlbot. For information on how to do this, see the README file.

The job is now configured, and we can call() Diffbot with instructions on how to create it:

$job->call();

The full code for creating this job is:

$diffbot = new Diffbot('my_token');
$job = $diffbot->crawl('sp_search');
 
$job
    ->setSeeds(['http://sitepoint.com'])
    ->notify('bruno.skvorc@sitepoint.com')
    ->setMaxToCrawl(1000000)
    ->setMaxToProcess(1000000)
    ->setRepeat(1)
    ->setMaxRounds(0)
    ->setPageProcessPatterns(['<section class="article_body">'])
    ->setOnlyProcessIfNew(1)
    ->setApi($diffbot->createArticleAPI('crawl')->setMeta(true)->setDiscussion(false))
    ->setUrlCrawlPatterns(['^http://www.sitepoint.com', '^http://sitepoint.com']);
 
$job->call();

Calling this script via command line (php job.php) or opening it in the browser has created the job – it can be seen in the Crawlbot dev screen:


It’ll take a while to finish (days, actually – SitePoint is a huge place), but all subsequent rounds will be faster because we told the job to only process pages it hasn’t encountered before.

Continue reading this article on SitePoint!

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.