Web Scraping for Beginners

By Shaumik Daityari
We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now

With the eCommerce boom, I have become a fan of price comparison apps in recent years. Each purchase I make online (or even offline) is the result of a thorough investigation across sites offering the product.

Some of the apps I use include RedLaser, ShopSavvy and BuyHatke, which have been doing great work in increasing transparency and saving the time of consumers.

Have you ever wondered how these apps get that important data? In most cases, the process employed by the apps is web scraping.

Web Scraping Defined

hands on keyboard

Web scraping is the process of extracting data on the web. With the right tools, anything that’s visible to you can be extracted. In this post, we’ll focus on writing programs that automate this process and help you gather huge amounts of data in a relatively short time. Apart from the example I’ve already given, scraping has a lot of uses like SEO tracking, job tracking, news analysis, and — my favorite — sentiment analysis on social media!

A note of caution

Before you go on a web scraping adventure, make sure you’re aware of the legal issues involved. Many websites specifically prohibit scraping in their terms of service. For example, to quote Medium, “Crawling the Services is allowed if done in accordance with the provisions of our robots.txt file, but scraping the Services is prohibited.” Scraping sites that do not allow scraping might actually get you blacklisted from them! Just like any other tool, web scraping can be used for for reasons like copying the content of other sites. Scraping has led to many lawsuits too.

Setting Up the Code

Now that you know that we must tread carefully, let’s get into scraping. Scraping can be done in any programming language, and we covered it for Node some time back. In this post, we’re going to use Python for the simplicity of the language and the availability of packages that make the process easy.

What’s the underlying process?

When you’re accessing a site on the Internet, you’re essentially downloading HTML code, which is analyzed and displayed by your web browser. This HTML code contains all the information that’s visible to you. Therefore, the required information (like the price) can be obtained by analyzing this HTML code. You can use regular expressions to search for your needle in the haystack, or use a library to parse the HTML and get the required data.

In Python, we’re going to use a module called Beautiful Soup to analyze this HTML data. You can install the module through an installer like pip by running the following command:

pip install beautifulsoup4

Alternately, you can build it from the source. The installation steps are listed on the module’s documentation page.

After getting that installed, we’ll broadly follow the following steps:

  • send a request to URL
  • receive the response
  • analyze the response to find required data.

For demonstration purposes, we’ll use my blog http://dada.theblogbowl.in/.

The first two steps are fairly simple, and can be accomplished as follows:

from urllib import urlopen

#Sending the http request
webpage = urlopen('http://my_website.com/').read()

Next, we need to provide the response to

from bs4 import BeautifulSoup
#making the soup! yummy ;)
soup = BeautifulSoup(webpage, "html5lib")

Notice that we used html5lib as our parser. You may install a different parser for BeautifulSoup as mentioned in their documentation.

Parsing the HTML

Now that we’ve supplied the HTML to BeautifulSoup, let’s check out a few commands. To check that we have the correct HTML markup, let’s verify the title of the page (on the Python interpreter):

>>> soup.title
<title>Transcendental  Tech Talk</title>
>>> soup.title.text
u'Transcendental  Tech Talk'
>>>

Next, we move on to extracting specific elements from the page. Let’s say I want to extract the list of titles of posts on my blog. To do so, I would need to analyze the HTML structure, which I accomplish through the Chrome Inspector (Right click on an item and select “Inspect Element”). Similar tools are available in other browsers too.

using Chrome's inspector

Using the Chrome Inspector to check the HTML structure of a page

As you can observe, all titles are housed under the h3 tags, with two classes — post-title and entry-title. Searching for all h3 elements with the class post-title should get me the list of titles on the page. We use the find_all function of BeautifulSoup and use the class_ argument to specify our class:

>>> titles = soup.find_all('h3', class_ = 'post-title') #Getting all titles
>>> titles[0].text
u'\nKolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips\n'
>>>

The same result could be achieved by searching for items with the class post-title:

>>> titles = soup.find_all(class_ = 'post-title') #Getting all items with class post-title
>>> titles[0].text
u'\nKolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips\n'
>>>

In case you’re interested in the links to further explore the items, you can run the following:

>>> for title in titles:
...     # Each title is in the form of <h3 ...><a href=...>Post Title<a/></h3>
...     print title.find("a").get("href")
...
http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html
http://dada.theblogbowl.in/2015/09/i-got-published.html
http://dada.theblogbowl.in/2014/12/how-to-use-requestput-or-requestdelete.html
http://dada.theblogbowl.in/2014/12/zico-isl-and-atk.html
...
>>>

There are many built-in methods in BeautifulSoup to navigate through the HTML, some of them illustrated below:

>>> titles[0].contents
[u'\n', <a href="http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html">Kolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips</a>, u'\n']
>>>

Do note that you may also use the children attribute, but it acts as a generator:

>>> titles[0].parent
<div class="post hentry uncustomized-post-template">\n<a name="6501973351448547458"></a>\n<h3 class="post-title entry-title">\n<a href="http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html">Kolkata #BergerXP IndiBlogger ...
>>>

You can use regular expressions to search for the CSS class too, as explained in the documentation.

Emulate Logins Using Mechanize

What we’ve done up to now is essentially download a page and analyze its contents. However, a web developer may have blocked requests through non-browsers, or a part of a website might only be accessible after a login. How should we go about the process then?

In the first case, we need to emulate a browser when we’re sending a request to a page. Every HTTP request has a number of associated headers that include information about things like the visitor’s browser, operating system and screen size. We can manipulate that and make it look like a browser is sending the request.

In the second case, we need to log in to the website and maintain the session using cookies in order to access restricted areas. Let’s see how to do this while also emulating a browser.

We’ll use the module cookielib for managing our session using cookies. Further, we’ll use mechanize, which can be installed through an installer like pip.

We’ll login through this page on The Blog Bowl, and then access our notifications page. The code is explained inline through the comments:

import mechanize
import cookielib

from urllib import urlopen
from bs4 import BeautifulSoup

# Cookie Jar
cj = cookielib.LWPCookieJar()

browser = mechanize.Browser()
browser.set_cookiejar(cj)
browser.set_handle_robots(False)
browser.set_handle_redirect(True)

# Solving issue #1 by emulating a browser by adding HTTP headers
browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# Open Login Page
browser.open("http://theblogbowl.in/login/")

# Select Login form (1st form of the page)
browser.select_form(nr = 0)
# Alternate syntax - browser.select_form(name = "form_name")

# The first <input> tag of the form is a CSRF token
# Setting the 2nd and 3rd tags to email and password
browser.form.set_value("email@example.com", nr=1)
browser.form.set_value("password", nr=2)

# Logging in
response = browser.submit()

# Opening new page after login
soup = BeautifulSoup(browser.open('http://theblogbowl.in/notifications/').read(), "html5lib")
the notifications page

Structure of the notifications page

# Print notifications
print soup.find(class_ = "search_results").text
results of logging in

Results of logging in to the notification page

Final Words

As many developers will tell you that anything you can view online can be scraped. With this post, you know that something behind a login can also easily be extracted. In cases where your IP gets blocked, you may mask your IP address (or use a different one). To make it look like a human is accessing the data, you may maintain a time lag between your requests too.

With the increasing need for data, web scraping (for both good and bad reasons) is only going to increase in the future. It is thus advisable that you understand the process either to use it effectively or save yourself from it!

We teamed up with SiteGround
To bring you the latest from the web and tried-and-true hosting, recommended for designers and developers. SitePoint Readers Get Up To 65% OFF Now
  • chronicler_Isiah

    The whole web scraping thing is in my opinion bad, unless a site gives express permission that scraping content is ok.

    Despite the article’s attempt at covering itself against scraping sites illegally, most turds who scrape website content probably never read a site’s browsing policy, follow the instruction of a robot file, or if they do, don’t give a damn and scrape anyway.

    Plus, I think you are being irresponsible in actively sharing code for both scraping at a basic level, and, if I understand it correctly, demonstrating methods to get around an honest site’s attempt to make their content more secure.

    Let’s face it, most scrapers are scraping for all the wrong reasons and webmasters spend a lot of time and effort to quite rightly protect their content from such dubious activities.

    I’m frankly disappointed that this content has appeared here. My opinion of course, but I hope one shared by other webmasters/content owners too.

    Surely the responsible approach would not be instructing the ‘how to’ but the ‘how to defend against’. Maybe that will form the basis of another article?

    • Hi,

      Thanks for the detailed comment.

      Frankly, I don’t think it is irresponsible to share code that scrapes a website. If someone is smart enough to figure out how to change this code to scrape someone else’s website, they would also be smart enough to Google about these tools and build this code themselves – the code that I have shared has very simple logic, actually.

      I have already written earlier about how to tweak .htaccess in Apache to prevent scrapers – https://www.sitepoint.com/using-htaccess-prevent-web-scraping/

      • chronicler_Isiah

        You are totally missing the point my friend. If sites were willing to share their expensive to produce content – then you would have no need to try and circumvent log in areas to get to it.

        I stand by my point at the total editorial irresponsiblity of allowing this kind of article on this site.

        • One use case for using logins to scrape information is by financial apps such as these – https://play.google.com/store/apps/details?id=com.msf.abm.mobile&hl=en

          Agreed there are apps that read your SMS to provide your financial analysis, but a comprehensive solution such as MyUniverse does it by using your login information to scrape the data.

        • Abdul Azeem Khan

          Another case in point, what if I am a legit user of a website and have valid login creds. There is data that I am entitled to get anyways, but I want to automate it through scraping because the company that owns the website dosen’t have the resources to make an API. Wouldn’t scrapping in that case be justified? So I don’t think the author is missing the point. Its like knife, you can use it to spread butter or hurt someone. It’s just a tool.

          • chronicler_Isiah

            I don’t see that argument at all.

            You are not entitled to scrape data just because you have a log in. Content is always owned by the site owner – not the person logging in.

          • Abdul Azeem Khan

            chromicle_Isiah, i suppose you are right when you say the data belongs to the site owner, however if the owner does not care if someone scrapes their data, then i wouldn’t consider scraping irresponsible.

          • chronicler_Isiah

            Yet it still counts as theft of copyright material.