SitePoint Sponsor

User Tag List

Results 1 to 7 of 7

Hybrid View

  1. #1
    SitePoint Evangelist bradical1379's Avatar
    Join Date
    Feb 2007
    Posts
    442
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Best way to preventing content scraping?

    I have a relatively large site with a lot of information that I want to better protect from being scraped by BOTs and content grabbers.

    How have others handled this?

    My guess is that a simply ip address table that checks last visit time and time difference between now and then with some logic would be the easiest way to start the prevention of it, but wanted to hear some opinions on how others have fought the good fight.

    Thanks.

  2. #2
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,026
    Mentioned
    64 Post(s)
    Tagged
    0 Thread(s)
    Don't publish it.

    It's really that simple.

    First, search engines have to scrape your site for content to be able to index it, so if you find a way to thwart that you won't be listed on any search engines so no one will find your site to care about it.

    But if you are that worried about what you have being stolen, don't publish it. Cause once it's out there, its gonna stay out there, if not in Google's cache then somewhere else.

  3. #3
    SitePoint Evangelist bradical1379's Avatar
    Join Date
    Feb 2007
    Posts
    442
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michael Morris View Post
    Don't publish it.

    It's really that simple.

    First, search engines have to scrape your site for content to be able to index it, so if you find a way to thwart that you won't be listed on any search engines so no one will find your site to care about it.

    But if you are that worried about what you have being stolen, don't publish it. Cause once it's out there, its gonna stay out there, if not in Google's cache then somewhere else.
    It has to be published - its the backbone of the site. I am not worried about Google or other search engines crawling the site - I am worried about someone else grabbing my information and starting their own site like mine.

  4. #4
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,026
    Mentioned
    64 Post(s)
    Tagged
    0 Thread(s)
    Stoping it at a code level isn't possible. What you can do, what I do with my own published works, search for long phrases from the work and see what hits. That will turn the copy cats up sooner or later. Incidently, Google does this themselves and SEO deranks sites that share large amounts of text with another site. So even if they do copy you, they won't beat you on search rankings. You'll also have legal recourse such as a DMCA takedown notice.

    And no matter how narrow your writing, someone, somewhere will plagiarize it eventually. It's just the nature of the beast. If that isn't acceptable you'll need to keep it to yourself.

  5. #5
    SitePoint Guru bronze trophy
    Join Date
    Dec 2003
    Location
    Poland
    Posts
    930
    Mentioned
    7 Post(s)
    Tagged
    0 Thread(s)
    There's no foolproof way of protecting your content if you publish it, however there are ways to make scraping content more difficult and this way only the determined ones will copy your content and you may possibly delay any potential plagiarism as it will require a bit more effort than just using some free website copier or ready-made bot. Some protections I can think of:

    1. Don't publish your content in HTML. Instead, make it available by javascript that loads the content on demand from your server by ajax. You can apply some simple encryption or scrambling mechanism on top of that.

    2. Publish content in some other less common format, for example SVG, canvas, etc. As far as I know, scribd.com uses many of such tricks combined with heavy scripting and scrambling - try to open one of their ebook in the browser and see for yourself.

    3. Put your content on the server in an encrypted form and require your users to access it through your applet using a browser plugin like Flash, Java or even your own crafted plugin that you will provide for download.

    4. Require CAPTCHA for access.

    5. Require registration for access. Paid registration is even better

    All of the methods have drawbacks in terms of additional amount of work required to implement and support such a system plus there are accessibility issues. And of course, no protection will deter the most determined.

  6. #6
    SitePoint Member
    Join Date
    Nov 2013
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    There are several ways to prevent content scraping. One of the options is to make CAPTCHA while accessing the site. Another option is to hire some anti scraping services like fireblade.com, scrapesentry.com and clareitysecurity.com.

  7. #7
    I solve practical problems. bronze trophy
    Michael Morris's Avatar
    Join Date
    Jan 2008
    Location
    Knoxville TN
    Posts
    2,026
    Mentioned
    64 Post(s)
    Tagged
    0 Thread(s)
    What's the point of this though? If you "secure" the content in this manner, sure you stop bots from copying the content. Humans, if motivated, still can thwart you. More importantly, such methods will stop spiders from indexing the site, so if you are looking to cash in on ad revenue for views on the content, you're shooting yourself in the foot because those techniques are extremely anti-SEO.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •