I have a relatively large site with a lot of information that I want to better protect from being scraped by BOTs and content grabbers.
How have others handled this?
My guess is that a simply ip address table that checks last visit time and time difference between now and then with some logic would be the easiest way to start the prevention of it, but wanted to hear some opinions on how others have fought the good fight.
First, search engines have to scrape your site for content to be able to index it, so if you find a way to thwart that you won’t be listed on any search engines so no one will find your site to care about it.
But if you are that worried about what you have being stolen, don’t publish it. Cause once it’s out there, its gonna stay out there, if not in Google’s cache then somewhere else.
It has to be published - its the backbone of the site. I am not worried about Google or other search engines crawling the site - I am worried about someone else grabbing my information and starting their own site like mine.
Stoping it at a code level isn’t possible. What you can do, what I do with my own published works, search for long phrases from the work and see what hits. That will turn the copy cats up sooner or later. Incidently, Google does this themselves and SEO deranks sites that share large amounts of text with another site. So even if they do copy you, they won’t beat you on search rankings. You’ll also have legal recourse such as a DMCA takedown notice.
And no matter how narrow your writing, someone, somewhere will plagiarize it eventually. It’s just the nature of the beast. If that isn’t acceptable you’ll need to keep it to yourself.
There’s no foolproof way of protecting your content if you publish it, however there are ways to make scraping content more difficult and this way only the determined ones will copy your content and you may possibly delay any potential plagiarism as it will require a bit more effort than just using some free website copier or ready-made bot. Some protections I can think of:
Publish content in some other less common format, for example SVG, canvas, etc. As far as I know, scribd.com uses many of such tricks combined with heavy scripting and scrambling - try to open one of their ebook in the browser and see for yourself.
Put your content on the server in an encrypted form and require your users to access it through your applet using a browser plugin like Flash, Java or even your own crafted plugin that you will provide for download.
Require CAPTCHA for access.
Require registration for access. Paid registration is even better
All of the methods have drawbacks in terms of additional amount of work required to implement and support such a system plus there are accessibility issues. And of course, no protection will deter the most determined.
What’s the point of this though? If you “secure” the content in this manner, sure you stop bots from copying the content. Humans, if motivated, still can thwart you. More importantly, such methods will stop spiders from indexing the site, so if you are looking to cash in on ad revenue for views on the content, you’re shooting yourself in the foot because those techniques are extremely anti-SEO.