Web Scraping in Node.js

Key Takeaways

Web scraping in Node.js involves downloading source code from remote servers and extracting data from the downloaded source code, which can be done using modules like cheerio and request.
The cheerio module, which implements a subset of jQuery, can construct a DOM from an HTML string and parse it, although it might struggle with poorly structured HTML.
A complete web scraper can be built by combining request and cheerio to extract specific elements from a webpage, but handling dynamic content, avoiding blocks, and dealing with sites that require login or use CAPTCHA can be more complex and may require additional tools or strategies.

Web scrapers are pieces of software which programmatically visit web pages and extract data from them. Web scraping is a bit of a controversial topic due to issues of content duplication. Instead, most web site owners prefer that their data be accessed by publicly available APIs. Unfortunately, many sites provide lackluster APIs, or none at all. This forces many developers to turn to web scraping. This article will teach you how to implement your own web scraper in Node.js.

The first step to web scraping is downloading source code from remote servers. In, “Making HTTP Requests in Node.js,” readers learned how to download pages using the request module. The following example provides a quick refresher on making GET requests in Node.js.



var request = require("request");
request({

uri: "http://www.sitepoint.com",

}, function(error, response, body) {

console.log(body);

});

The second, and more difficult, step to web scraping is extracting data from the downloaded source code. On the client side, this would be a trivial task using the selectors API, or a library like jQuery. Unfortunately, these solutions rely on the assumption that a DOM is available for querying. Sadly, Node.js does not provide a DOM. Or does it?

The Cheerio Module

While Node.js does not provide a built in DOM, there are several modules which can construct a DOM from a string of HTML source code. Two popular DOM modules are cheerio and jsdom. This article focuses on cheerio, which can be installed using the following command.

npm install cheerio

The cheerio module implements a subset of jQuery, meaning that many developers will be able to pick it up quickly. In fact, cheerio is so similar to jQuery that you can easily find yourself trying to use jQuery functions that aren’t implemented in cheerio.

The following example shows how cheerio is used to parse HTML strings. The first line imports cheerio into the program. The html variable holds the HTML fragment to be parsed. On line 3, the HTML is parsed using cheerio. The result is assigned to the $ variable. The dollar sign was chosen because it is traditionally used in jQuery. Line 4 selects the <ul> element using CSS style selectors. Finally, the list’s inner HTML is printed using the html() method.



var cheerio = require("cheerio");

var html = "<ul><li>foo</li><li>bar</li></ul>";

var $ = cheerio.load(html);

var list = $("ul");
console.log(list.html());

Limitations

cheerio is under active development, and getting better all the time. However, it still has a number of limitations. The most frustrating aspect of cheerio is the HTML parser. HTML parsing is a hard problem, and there are a lot of pages in the wild that contain bad HTML. While cheerio won’t crash on these pages, you might find yourself unable to select elements. This can make it difficult to determine if a bug lies in your selector or the page itself.

Scraping JSPro

The following example combines request and cheerio to build a complete web scraper. The example scraper extracts the titles and URLs of all of the articles on the JSPro homepage. The first two lines import the required modules into the example. Lines 3 through 5 download the source code of the JSPro homepage. The source is then passed to cheerio for parsing.



var request = require("request");

var cheerio = require("cheerio");
request({

uri: "http://www.sitepoint.com",

}, function(error, response, body) {

var $ = cheerio.load(body);
$(".entry-title > a").each(function() {

var link = $(this);

var text = link.text();

var href = link.attr("href");
console.log(text + " -> " + href);

});

});

If you view the JSPro source code, you’ll notice that every article title is a link contained in a <h1> element of class entry-title. The selector on line 7 selects all of the article links. The each() function is then used to loop over all of the articles. Finally, the article title and URL are taken from the link’s text and href attribute, respectively.

Conclusion

This article has shown you how to create a simple web scraping program in Node.js. Please note that this is not the only way to scrape a web page. There are other techniques, such as employing a headless browser, which are more powerful, but might compromise simplicity and/or speed. Look out for an upcoming article focusing on the PhantomJS headless browser.

Frequently Asked Questions (FAQs) on Web Scraping in Node.js

How can I handle dynamic content while web scraping in Node.js?

Handling dynamic content while web scraping in Node.js can be a bit tricky as the content is loaded asynchronously. You can use libraries like Puppeteer, which is a Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium. This allows you to scrape dynamic content by simulating user interactions.

How can I avoid getting blocked while web scraping?

Web scraping can sometimes lead to your IP getting blocked if the website detects unusual traffic. To avoid this, you can use techniques like rotating your IP addresses, using delay, or even using a scraping API that handles these issues automatically.

How can I scrape data from a website that requires login?

To scrape data from a website that requires login, you can use Puppeteer. Puppeteer can simulate the login process by filling in the login form and submitting it. After logging in, you can navigate to the desired page and scrape the data.

How can I save the scraped data in a database?

After scraping the data, you can use a database client for your database of choice. For example, if you’re using MongoDB, you can use the MongoDB Node.js client to connect to your database and save the data.

How can I scrape data from a website with pagination?

To scrape data from a website with pagination, you can use a loop to navigate through the pages. In each iteration, you can scrape the data from the current page and then click the ‘next’ button to navigate to the next page.

How can I scrape data from a website with infinite scrolling?

To scrape data from a website with infinite scrolling, you can use Puppeteer to simulate the scroll down action. You can use a loop to keep scrolling down until no more new data is loaded.

How can I handle errors while web scraping?

Error handling is crucial in web scraping. You can use try-catch blocks to handle errors. In the catch block, you can log the error message which will help you debug the issue.

How can I scrape data from a website that uses AJAX?

To scrape data from a website that uses AJAX, you can use Puppeteer. Puppeteer can wait for AJAX calls to finish and then scrape the data.

How can I speed up web scraping in Node.js?

To speed up web scraping, you can use techniques like parallelism where you open multiple pages in different tabs and scrape data from them simultaneously. However, be careful not to overload the website with too many requests as it may lead to your IP getting blocked.

How can I scrape data from a website that uses CAPTCHA?

Scraping data from a website that uses CAPTCHA can be challenging. You can use services like 2Captcha that provide an API to solve the CAPTCHA. However, keep in mind that this may not be legal or ethical in some cases. Always respect the website’s terms of service.