Using .htaccess to Prevent Web Scraping

Shaumik Daityari

Web scraping, known as content scraping, data scraping, web harvesting, or web data extraction, is a way of extracting data from websites, preferably using a program that sends a number of HTTP requests, emulating human behaviour, getting the responses and extracting the required data out of them. Modern GUI-based web scrapers like Kimono enable you to perform this task without any programming knowledge.

If you face the problem of others scraping content from one of your websites, there are many ways of detecting web scrapers — Google Webmaster Tools and Feedburner to name a few tools.

In this article, we will discuss a few ways to make the lives of these scrapers difficult, using .htaccess files in Apache.

An .htaccess (hypertext access) file is a plain text configuration file for web servers that overrides the global server settings for the directory where the file is placed. They can be innovatively used to prevent web scraping.

Before we discuss the specific methods, let me clear up one small fact: If something is publicly available, it can be scraped. The steps that we discuss here can only make things more difficult, not impossible. However, what would you do if someone is smart enough to bypass all your filters? We have a solution for that too.

Getting Started with .htaccess

Since the use of .htaccess files involves Apache checking and reading all .htaccess files on every request, it is generally turned off by default. There are different processes to enable it in Ubuntu, OS X and Windows. Your .htaccess files will be interpreted by Apache only after you enable them, or they will be simply ignored.

Next, in most of our use cases, we will be using the RewriteEngine of Apache, which is a part of the mod_rewrite module. If necessary, you could check out a detailed guide on how to set up mod_rewrite for Apache or a general guide on .htaccess.

Once you have completed these, you are ready to proceed with the solutions discussed here on dealing with content scrapers. If you haven’t completed either of these steps successfully, Apache will ignore your .htaccess files or raise an error when you restart it after making changes.

Prevent Hotlinking

If someone scrapes your content, all your inline HTML remains the same. This means that the links to the images that were part of your content (and most probably hosted on your domain) remain the same. If the scraper wishes to put the content on a different website, the image would still link back to the original source. This is called hotlinking. Hotlinking costs you bandwidth because every time someone opens the scraper’s site, your image is downloaded.

You can prevent hotlinking by adding the following lines to your .htaccess file.

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$

# domains that can link to your content (images here)
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mysite.com [NC]

# show no image when hotlinked
RewriteRule \.(jpg|png|gif)$ – [NC,F,L]

# Or show an alternate image
# RewriteRule \.(jpg|png|gif)$ http://mysite.com/forbidden_image.jpg [NC,R,L]

Some notes about the code:

  • Switching on RewriteEngine gives us the ability to redirect the user’s request.
  • RewriteCond specifies which requests should be redirected. %{HTTP_REFERER} is the variable that contains the domain from which the request was made.
  • Then we match it with our own domain mysite.com. We add (www\.) to ensure requests from both mysite.com and www.mysite.com are allowed. Similarly, our code covers http and https.
  • Next, we check if a jpg, png, or gif file was requested, and either show an error or redirect the request to an alternate image.
  • NC ignores the case, F shows a 403 Forbidden error, R redirects the request, and L stops rewriting.
  • Note that you should apply only one of the rules above (either the 403 error or the alternate image). This is because as soon as L is encountered, Apache would not apply any other rules. In the code example above, the alternate image method is commented out.

How Can Web Scrapers Bypass This?

One way for a web scraper to bypass such a hurdle is to download images as it encounters them in the HTML code. In such a case, a regular expression check can be applied, the images downloaded, and the links of the images changed while storing the data in the system.

Allow or Block Requests From Specific IP Addresses

If you happen to determine the origin of the requests of the web scraper (usually, it’s an unnaturally high number of requests from the same IP address), you can block requests from that IP address.

Order Deny
Deny from xxx.xxx.xxx.xxx

In the code above (and in other examples in this article) you would replace xxx.xxx.xxx.xxx with the IP address you want to block. If you are really paranoid about security, you could deny requests from all IP addresses and selectively allow from a whitelist of IP addresses:

order deny,allow
Deny from all
# IP Address whitelist 
allow from xx.xxx.xx.xx
allow from xx.xxx.xx.xx

One use case for this technique (not related to web scraping) is blocking access to the WordPress’s wp-admin directory. In such a case, you would allow requests from only your IP address, eliminating the possibility of someone hacking your site via wp-admin.

How Can Web Scrapers Bypass This?

If a web scraper has access to proxies, it could distribute its requests over the list of IP addresses to avoid abnormal activity from one IP address.

To explain: Let’s say someone is scraping your site from IP address 1.1.1.1. So you block 1.1.1.1 using .htaccess. Now, if the scraper has access to a proxy server 2.2.2.2, it routes its request through 2.2.2.2, so it appears to your server that the request is coming from 2.2.2.2. So, in spite of blocking 1.1.1.1, the scraper is still able to access the resource.

Thus, if the scraper has access to thousands of these proxies, it can become undetectable if it sends requests in low numbers from each proxy.

Redirect Requests From an IP Address

You can not only block any IP address, you can redirect them to a different page too:

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com [R,L]

If you redirect them to a static site, chances are the scraper will figure this out. However, you can go one step further and do something a bit more innovative. For that, you need to understand how your content is scraped.

Web scraping is a systematic procedure. It involves studying URL patterns and sending requests to all possible pages on a website. If you are a WordPress user, for instance, the URL pattern is http://mysite.com/?p=[page_no], where you increment page_no from 1 to a large number.

What you could do is create a page especially for redirection that redirects the request to one out of a number of predefined pages:

RewriteCond %{REMOTE_ADDR} xxx\.xxx\.xxx\.
RewriteRule .* http://mysite.com/redirection_page [R,L]

In the above code, “redirection_page” would be the page used to do one of the subsequent predefined redirects. Therefore, when a web scraping program is running, it would be redirected to a number of pages and it would be difficult to detect that you have identified the scraper.

Alternately, “redirection_page” can redirect to a third page “redirection_page_1″, which would then redirect back to “redirection_page”. This would lead to a redirect loop, and a request would get bounced back between the two pages indefinitely.

How Can Web Scrapers Bypass This?

A web scraper could check for redirection of the request. If there is a redirect, it would get a 301 or 302 HTTP status code. If there was no redirection, it would get the normal 200 status code.

Matt Cutts to the Rescue

Matt Cutts is the head of the web spam team at Google. Part of his job is to be on constant lookout for scraping sites. If he doesn’t like your website, he can make it vanish from Google’s search results. The recent Panda and Penguin updates to Google’s search algorithm have affected a huge number of sites, including a number of scraper sites.

A webmaster can report scraper sites to Google using this form, providing the source of the content. If you produce original content, you would definitely be on the radar of web scrapers. Yet, if they re-publish your content, Google will make sure that they are omitted from its search results.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Mr B

    Is there a way to deny a whole block? I want to deny every IP from 188.143.232.xxx
    Order Deny
    Deny from xxx.xxx.xxx

  • LouisLazaris

    Looks like we did the same Google search. :)

    • http://dada.theblogbowl.in/ Shaumik Daityari

      Yes, Louis! Since you are in the mod list, your comment was published without moderation, but mine was on hold since it contained a link! And I had commented maybe a few minutes before you did (but it wasn’t visible to you or anyone else!) :)

      • LouisLazaris

        Haha, yeah, I figured that’s probably what happened.

  • http://petermeadit.com/ Peter Mead

    Great round up of .htaccess. Lately I seem to be blocking semalt in .htaccess, then they spring up again with different host names on different domains. I would not mind so much but they really push your bounce rate up when they crawl because they are not registering as a crawler but rather as a visitor.

    anyone else had this?

  • http://petermeadit.com/ Peter Mead

    Thannks Shaumik, for the info.