Web Scraping for Beginners
With the eCommerce boom, I have become a fan of price comparison apps in recent years. Each purchase I make online (or even offline) is the result of a thorough investigation across sites offering the product.
Some of the apps I use include RedLaser, ShopSavvy and BuyHatke, which have been doing great work in increasing transparency and saving the time of consumers.
Have you ever wondered how these apps get that important data? In most cases, the process employed by the apps is web scraping.
Web Scraping Defined
Web scraping is the process of extracting data on the web. With the right tools, anything that’s visible to you can be extracted. In this post, we’ll focus on writing programs that automate this process and help you gather huge amounts of data in a relatively short time. Apart from the example I’ve already given, scraping has a lot of uses like SEO tracking, job tracking, news analysis, and — my favorite — sentiment analysis on social media!
A note of caution
Before you go on a web scraping adventure, make sure you’re aware of the legal issues involved. Many websites specifically prohibit scraping in their terms of service. For example, to quote Medium, “Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.” Scraping sites that do not allow scraping might actually get you blacklisted from them! Just like any other tool, web scraping can be used for for reasons like copying the content of other sites. Scraping has led to many lawsuits too.
Setting Up the Code
Now that you know that we must tread carefully, let’s get into scraping. Scraping can be done in any programming language, and we covered it for Node some time back. In this post, we’re going to use Python for the simplicity of the language and the availability of packages that make the process easy.
What’s the underlying process?
When you’re accessing a site on the Internet, you’re essentially downloading HTML code, which is analyzed and displayed by your web browser. This HTML code contains all the information that’s visible to you. Therefore, the required information (like the price) can be obtained by analyzing this HTML code. You can use regular expressions to search for your needle in the haystack, or use a library to parse the HTML and get the required data.
In Python, we’re going to use a module called Beautiful Soup to analyze this HTML data. You can install the module through an installer like
pip by running the following command:
pip install beautifulsoup4
Alternately, you can build it from the source. The installation steps are listed on the module’s documentation page.
After getting that installed, we’ll broadly follow the following steps:
- send a request to URL
- receive the response
- analyze the response to find required data.
For demonstration purposes, we’ll use my blog
The first two steps are fairly simple, and can be accomplished as follows:
from urllib import urlopen #Sending the http request webpage = urlopen('http://my_website.com/').read()
Next, we need to provide the response to
from bs4 import BeautifulSoup #making the soup! yummy ;) soup = BeautifulSoup(webpage, "html5lib")
Notice that we used
html5lib as our parser. You may install a different parser for BeautifulSoup as mentioned in their documentation.
Parsing the HTML
Now that we’ve supplied the HTML to BeautifulSoup, let’s check out a few commands. To check that we have the correct HTML markup, let’s verify the title of the page (on the Python interpreter):
>>> soup.title <title>Transcendental Tech Talk</title> >>> soup.title.text u'Transcendental Tech Talk' >>>
Next, we move on to extracting specific elements from the page. Let’s say I want to extract the list of titles of posts on my blog. To do so, I would need to analyze the HTML structure, which I accomplish through the Chrome Inspector (Right click on an item and select “Inspect Element”). Similar tools are available in other browsers too.
Using the Chrome Inspector to check the HTML structure of a page
As you can observe, all titles are housed under the
h3 tags, with two classes —
entry-title. Searching for all
h3 elements with the class
post-title should get me the list of titles on the page. We use the
find_all function of BeautifulSoup and use the
class_ argument to specify our class:
>>> titles = soup.find_all('h3', class_ = 'post-title') #Getting all titles >>> titles.text u'\nKolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips\n' >>>
The same result could be achieved by searching for items with the class
>>> titles = soup.find_all(class_ = 'post-title') #Getting all items with class post-title >>> titles.text u'\nKolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips\n' >>>
In case you’re interested in the links to further explore the items, you can run the following:
>>> for title in titles: ... # Each title is in the form of <h3 ...><a href=...>Post Title<a/></h3> ... print title.find("a").get("href") ... http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html http://dada.theblogbowl.in/2015/09/i-got-published.html http://dada.theblogbowl.in/2014/12/how-to-use-requestput-or-requestdelete.html http://dada.theblogbowl.in/2014/12/zico-isl-and-atk.html ... >>>
There are many built-in methods in BeautifulSoup to navigate through the HTML, some of them illustrated below:
>>> titles.contents [u'\n', <a href="http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html">Kolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips</a>, u'\n'] >>>
Do note that you may also use the
children attribute, but it acts as a generator:
>>> titles.parent <div class="post hentry uncustomized-post-template">\n<a name="6501973351448547458"></a>\n<h3 class="post-title entry-title">\n<a href="http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html">Kolkata #BergerXP IndiBlogger ... >>>
You can use regular expressions to search for the CSS class too, as explained in the documentation.
Emulate Logins Using Mechanize
What we’ve done up to now is essentially download a page and analyze its contents. However, a web developer may have blocked requests through non-browsers, or a part of a website might only be accessible after a login. How should we go about the process then?
In the first case, we need to emulate a browser when we’re sending a request to a page. Every HTTP request has a number of associated headers that include information about things like the visitor’s browser, operating system and screen size. We can manipulate that and make it look like a browser is sending the request.
In the second case, we need to log in to the website and maintain the session using cookies in order to access restricted areas. Let’s see how to do this while also emulating a browser.
We’ll use the module
cookielib for managing our session using cookies. Further, we’ll use
mechanize, which can be installed through an installer like
import mechanize import cookielib from urllib import urlopen from bs4 import BeautifulSoup # Cookie Jar cj = cookielib.LWPCookieJar() browser = mechanize.Browser() browser.set_cookiejar(cj) browser.set_handle_robots(False) browser.set_handle_redirect(True) # Solving issue #1 by emulating a browser by adding HTTP headers browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:188.8.131.52) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] # Open Login Page browser.open("http://theblogbowl.in/login/") # Select Login form (1st form of the page) browser.select_form(nr = 0) # Alternate syntax - browser.select_form(name = "form_name") # The first <input> tag of the form is a CSRF token # Setting the 2nd and 3rd tags to email and password browser.form.set_value("email@example.com", nr=1) browser.form.set_value("password", nr=2) # Logging in response = browser.submit() # Opening new page after login soup = BeautifulSoup(browser.open('http://theblogbowl.in/notifications/').read(), "html5lib")
Structure of the notifications page
# Print notifications print soup.find(class_ = "search_results").text
Results of logging in to the notification page
As many developers will tell you that anything you can view online can be scraped. With this post, you know that something behind a login can also easily be extracted. In cases where your IP gets blocked, you may mask your IP address (or use a different one). To make it look like a human is accessing the data, you may maintain a time lag between your requests too.
With the increasing need for data, web scraping (for both good and bad reasons) is only going to increase in the future. It is thus advisable that you understand the process either to use it effectively or save yourself from it!