How would you be able to determine whether a visit to a page is a real visitor or a bot/crawler?
The only way I can think of at the moment is comparing the $_SERVER[‘HTTP_USER_AGENT’] with a known list of bots/crawlers compiled from sites like these:
I realize this method would not be exact (agent spoofing or new agents/versions not on the list), but is there a better method of detection, or is this pretty much it?
Just how does that distinguish between a spider and someone who has set their useragent to match that used by a spider (bith IE and Firefox allow the browser owner to set the useragent to anything at all).
Just about every stats program I have seen produces stats on whether the visitors are real people or bots. Therefore the question of distinguishing between the two is unnecessary if you are intending to use the information to look at stats since all the stats programs already perform those analyses.
You would only need to distinguish the one from the other therefore for other purposes than stats and there the best advise that can be given is to not even attempt to make the distinction since legitimate bots from search engines etc will penalise any page that attempts to treat them differently from real visitors and that difference can be easily detected. In fact the main reason why anyone would set their useragent to match that of one of the search engine spiders is so as to see how the page looks to that spider and (mainly in order to see if it looks any different to them).
The JavaScript writes an image tag. The image tag’s source points to a PHP script. The PHP script records the hit. You don’t need to program any logic, it’s inherent in the design. Browsers with JS disabled don’t get counted… but that 1-2% of your traffic isn’t going to change any of the macro level data you look at web stats for anyway.
So you would count the hit only if javascript sends back a hit and the image sends back a hit? What would happen with browsers with javascript disabled?
You eliminate 99% of bots/crawlers by using JavaScript & image based tracking, like Google Analytics or any other third party tracker. Very few crawlers are executing JavaScript then fetching images from the resulting image tags produced.
You don’t really have a choice in that case. If the only time you can collect a data point is when you’re sending a location header, all you have to tell the requestor from a bot is the user-agent string and IP address. So filter against known IPs and user-agents of bots, which browscap.ini is good for.