Web Scraping in Node.js
Web scrapers are pieces of software which programmatically visit web pages and extract data from them. Web scraping is a bit of a controversial topic due to issues of content duplication. Instead, most web site owners prefer that their data be accessed by publicly available APIs. Unfortunately, many sites provide lackluster APIs, or none at all. This forces many developers to turn to web scraping. This article will teach you how to implement your own web scraper in Node.js.
The first step to web scraping is downloading source code from remote servers. In, “Making HTTP Requests in Node.js,” readers learned how to download pages using the request
module. The following example provides a quick refresher on making GET
requests in Node.js.
var request = require("request");
request({
uri: "http://www.sitepoint.com",
}, function(error, response, body) {
console.log(body);
});
The second, and more difficult, step to web scraping is extracting data from the downloaded source code. On the client side, this would be a trivial task using the selectors API, or a library like jQuery. Unfortunately, these solutions rely on the assumption that a DOM is available for querying. Sadly, Node.js does not provide a DOM. Or does it?
The Cheerio Module
While Node.js does not provide a built in DOM, there are several modules which can construct a DOM from a string of HTML source code. Two popular DOM modules are cheerio
and jsdom
. This article focuses on cheerio
, which can be installed using the following command.
npm install cheerio
The cheerio
module implements a subset of jQuery, meaning that many developers will be able to pick it up quickly. In fact, cheerio
is so similar to jQuery that you can easily find yourself trying to use jQuery functions that aren’t implemented in cheerio
.
The following example shows how cheerio
is used to parse HTML strings. The first line imports cheerio
into the program. The html
variable holds the HTML fragment to be parsed. On line 3, the HTML is parsed using cheerio
. The result is assigned to the $
variable. The dollar sign was chosen because it is traditionally used in jQuery. Line 4 selects the <ul>
element using CSS style selectors. Finally, the list’s inner HTML is printed using the html()
method.
var cheerio = require("cheerio");
var html = "<ul><li>foo</li><li>bar</li></ul>";
var $ = cheerio.load(html);
var list = $("ul");
console.log(list.html());
Limitations
cheerio
is under active development, and getting better all the time. However, it still has a number of limitations. The most frustrating aspect of cheerio
is the HTML parser. HTML parsing is a hard problem, and there are a lot of pages in the wild that contain bad HTML. While cheerio
won’t crash on these pages, you might find yourself unable to select elements. This can make it difficult to determine if a bug lies in your selector or the page itself.
Scraping JSPro
The following example combines request
and cheerio
to build a complete web scraper. The example scraper extracts the titles and URLs of all of the articles on the JSPro homepage. The first two lines import the required modules into the example. Lines 3 through 5 download the source code of the JSPro homepage. The source is then passed to cheerio
for parsing.
var request = require("request");
var cheerio = require("cheerio");
request({
uri: "http://www.sitepoint.com",
}, function(error, response, body) {
var $ = cheerio.load(body);
$(".entry-title > a").each(function() {
var link = $(this);
var text = link.text();
var href = link.attr("href");
console.log(text + " -> " + href);
});
});
If you view the JSPro source code, you’ll notice that every article title is a link contained in a <h1>
element of class entry-title
. The selector on line 7 selects all of the article links. The each()
function is then used to loop over all of the articles. Finally, the article title and URL are taken from the link’s text and href
attribute, respectively.
Conclusion
This article has shown you how to create a simple web scraping program in Node.js. Please note that this is not the only way to scrape a web page. There are other techniques, such as employing a headless browser, which are more powerful, but might compromise simplicity and/or speed. Look out for an upcoming article focusing on the PhantomJS headless browser.