Web Scraping in Node.js

Colin Ihrig
Share

Web scrapers are pieces of software which programmatically visit web pages and extract data from them. Web scraping is a bit of a controversial topic due to issues of content duplication. Instead, most web site owners prefer that their data be accessed by publicly available APIs. Unfortunately, many sites provide lackluster APIs, or none at all. This forces many developers to turn to web scraping. This article will teach you how to implement your own web scraper in Node.js.

The first step to web scraping is downloading source code from remote servers. In, “Making HTTP Requests in Node.js,” readers learned how to download pages using the request module. The following example provides a quick refresher on making GET requests in Node.js.

var request = require("request");

request({
uri: "http://www.sitepoint.com",
}, function(error, response, body) {
console.log(body);
});

The second, and more difficult, step to web scraping is extracting data from the downloaded source code. On the client side, this would be a trivial task using the selectors API, or a library like jQuery. Unfortunately, these solutions rely on the assumption that a DOM is available for querying. Sadly, Node.js does not provide a DOM. Or does it?

The Cheerio Module

While Node.js does not provide a built in DOM, there are several modules which can construct a DOM from a string of HTML source code. Two popular DOM modules are cheerio and jsdom. This article focuses on cheerio, which can be installed using the following command.

npm install cheerio

The cheerio module implements a subset of jQuery, meaning that many developers will be able to pick it up quickly. In fact, cheerio is so similar to jQuery that you can easily find yourself trying to use jQuery functions that aren’t implemented in cheerio.

The following example shows how cheerio is used to parse HTML strings. The first line imports cheerio into the program. The html variable holds the HTML fragment to be parsed. On line 3, the HTML is parsed using cheerio. The result is assigned to the $ variable. The dollar sign was chosen because it is traditionally used in jQuery. Line 4 selects the <ul> element using CSS style selectors. Finally, the list’s inner HTML is printed using the html() method.

var cheerio = require("cheerio");
var html = "<ul><li>foo</li><li>bar</li></ul>";
var $ = cheerio.load(html);
var list = $("ul");

console.log(list.html());

Limitations

cheerio is under active development, and getting better all the time. However, it still has a number of limitations. The most frustrating aspect of cheerio is the HTML parser. HTML parsing is a hard problem, and there are a lot of pages in the wild that contain bad HTML. While cheerio won’t crash on these pages, you might find yourself unable to select elements. This can make it difficult to determine if a bug lies in your selector or the page itself.

Scraping JSPro

The following example combines request and cheerio to build a complete web scraper. The example scraper extracts the titles and URLs of all of the articles on the JSPro homepage. The first two lines import the required modules into the example. Lines 3 through 5 download the source code of the JSPro homepage. The source is then passed to cheerio for parsing.

var request = require("request");
var cheerio = require("cheerio");

request({
uri: "http://www.sitepoint.com",
}, function(error, response, body) {
var $ = cheerio.load(body);

$(".entry-title > a").each(function() {
var link = $(this);
var text = link.text();
var href = link.attr("href");

console.log(text + " -> " + href);
});
});

If you view the JSPro source code, you’ll notice that every article title is a link contained in a <h1> element of class entry-title. The selector on line 7 selects all of the article links. The each() function is then used to loop over all of the articles. Finally, the article title and URL are taken from the link’s text and href attribute, respectively.

Conclusion

This article has shown you how to create a simple web scraping program in Node.js. Please note that this is not the only way to scrape a web page. There are other techniques, such as employing a headless browser, which are more powerful, but might compromise simplicity and/or speed. Look out for an upcoming article focusing on the PhantomJS headless browser.