Ban bots from downloading

I have a pdf I want to give it away for free download on my site, no need to have a user account, but to prevent bots from abusing download how can I set a human verification? I was thinking of captcha, but captcha are for forms, is it absolutely necessary to collect user data to allow them downloads? How do you suggest human verification for downloads?

You could use captcha without gathering any other information. I am looking after a web site where we use captcha to verify a user (any user, not a logged in user) prior to displaying an email address, in theory so that bots can’t gather the email addresses. You could use it to verify the user, and on verification, trigger a PHP script that serves the PDF file to the user.

Note that I’m not saying that captcha is the best way to do this, or an effective way to do this (though it seems OK so far), only picking up on your note that it’s not suitable for you because there’s no form and no user logon in your situation. My involvement in the site I am looking after was to cobble something together quickly when recaptcha v1 stopped working.

By triggering a download script after verification, do you mean something like this?

if ($resp->isSuccess()) {
    require_once "download.php";
} else {
    $errors = $resp->getErrorCodes();
}
1 Like

I was looking for ReCaptcha v3. How do I know what is the best threshold to set?

And how v2 Invisible will know when popup the puzzle and when not?

Screw CAPTCHA because that uses Google which is evil!

Just create a simple logic problem that only a human could answer…

What is three plus eight?

How many US states are there?

What is the thirteenth letter of the alphabet?

That kind of thing - your download script then sends the appropriate headers and data, as long as it should.

These questions are US-centric, so this wouldn’t be right for international use (I know they were only examples).

I thought the same about the number of states (I’m never quite sure…) though I’d imagine the alphabet one is a bit more widely-known. As long as you specify which alphabet, of course. But of course, with states you have to take into account whether the question-setter is differentiating between states and commonwealths, or just counting them as “all the same”.

I actually asked my girlfriend and her mother how many there were (part of their larger question about how many senators there are). I gave them a hint it was 2x per state and they guessed 104. So even questions like that you have to be careful about; what’s obvious to you may not be to others.

This was ~ 3 weeks ago.

Any American who doesn’t know how many US states there are, or how many US Senators there are is not welcome on my website… :roll_eyes:

Ah, so your site is intended only for use by Americans. The OP might not have that same limitation.

2 Likes

How about rather than linking directly to the file you have a download script which checks the user agent and only serves the file to non-bots?

<?php
// prevent bots from downloading file. Make an exception for 'Cubot' which is a phone manufacturer
if (isset($_SERVER['HTTP_USER_AGENT']) && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']) && strpos($_SERVER['HTTP_USER_AGENT'], 'CUBOT') === false) {
 die;
}

header("Content-type:application/pdf");
header("Content-Disposition:attachment;filename='file.pdf'");
readfile("file.pdf");

Not foolproof but you wouldn’t need to use any human verification.

1 Like

Those regexp are enough for all bots and spiders? For example does it catch google bot too?

Should do. Google’s UA includes “googlebot” so that should be caught. TBH I grabbed that regex from Stack Overflow a little while ago but it’s working OK for me.

It may also be worth using the robots.txt file to block crawlers from indexing your downloads.

1 Like

I would suggest that you create an onclick that forces the user to type in something, maybe your domain. This way you can block all bots and got a custom solution.

1 Like

I’m now wondering what kind of bots you’re looking to block? If it’s just search engines, you could probably just have a form which POSTs to a script with the download, and search engines won’t follow that.

1 Like

All squares are rectangles.
Not all rectangles are squares.
All spiders are bots.
Not all bots are spiders.

Spiders are simple to stave off - flagging the link to the download as nofollow and a decent robots.txt will keep them from indexing or crawling your link.

Bots are a far broader category. The truth is that nothing will completely stop bots - or at least, bot assisted people - from accessing the file. (The number of spam threads even this forum gets should indicate that…)

A captcha is a good step; using javascript will frustrate simpler bots.

What makes you so concerned about bots, though? What indication do you have that bots intend to abuse your download? HOW do you anticipate bots abusing your download?