The Rise of Web Bots and Fall in Human Traffic

A couple of years ago I reported that 51% of all website traffic was non-human. The study, undertaken by Incapsula, has been updated. We have become the minority: bot traffic has reached 61.5%. I say “we”; there’s only a 38.5% chance you’re human.

The report data was gathered from 20,000 customers who use Incapsula’s security services. These are companies who are especially security-conscious or have been on the receiving end of nasty cyber attacks. They’re unlikely to represent the average website but the relative growth in bot traffic should be applicable.

The distribution indicates:

  • 38.5% is biological entities. Mostly humans, a few cats and assorted unclassified creatures.
  • 31.0% is search engine and other indexing bots (a rise of 55%).
  • 5.0% is content scrapers (no change). If you’re reading this anywhere other than SitePoint.com, you’re viewing a lazy copy of the original page. It won’t be as lovely an experience!
  • 4.5% is hacking tools (down 10%). Typically, this is malware, website attacks, etc.
  • 0.5% is spammer traffic (down 75%). That’s bots which post phishing or irritating content to blogs. Any negative comments below will certainly be from non-humans.
  • 20.5% is other impersonators (up 8%). These are denial of service attacks and marketing intelligence gathering.

The overall conclusion: bot traffic has increased by 21% in 18 months. However, the majority of this growth has come from cuddly good bots who have our best interests at heart (or should that be processor?)

Security Scares

A degree of cynicism is healthy. Incapsula is a security company; a rise in scare mongering has a direct correlation with their bottom line. That said, many companies are particularly lax about security until it’s too late. No system is ever 100% secure but the majority are caught by basic SQL injections or social engineering. Never underestimate the ingenuity of crackers … or the naivety of your boss.

Why Your Website Visitors are Falling

The rise of indexing bots is more interesting. We’re approaching a tipping point where the information you want won’t necessarily be obtained from the website where it originated. It’s already happening…

  • If you need company contact details, you enter the name in a search engine and it appears along with a map and directions.
  • If you want product information, you enter its name and can instantly view the specifications, prices and reviews.
  • You want to find the closest Indian restaurant; it magically appears on a map on your smartphone.

At no point did you visit the official company website. The data is scraped and repackaged for easier consumption on an alternative device such as a smart phone, watch or Google glasses.

This type of activity has been occurring for many years but it’s fairly simplistic and you can search for one or two inter-related factors. The real challenge will be non-explicit joined-up data queries, e.g. “find a heating-specialist who has worked for my neighbors” or “find all web design agencies in New York with a red logo”. The search engine or app could refine data to a handful of relevant results rather than thousands of website links. The rise in web bot indexing activity will inevitably intensify.

Of course, a business website will remain essential — but having one which can feed the bots is increasingly important. Direct human traffic to your website may even fall but bot-based sales leads will rise. If you’re not doing so already, it’s time to invest in machine-readable data exposure, e.g.

  • structured data formats from Schema.org
  • item-specific data feeds such as products and services
  • discoverable, URL-based REST APIs
  • RSS and sitemap feeds.

The bots may be working for us, but they’re rapidly becoming our masters.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Craig Buckler

    I doubt there are any figures — it would be difficult to determine whether a system parses and stores the data. Certainly Google indexes some information but I suspect it’s one of the few.

    It’s also possible microdata will become redundant as HTML semantics improve, e.g. the time element is likely to indicate an event without special classes or structures. However, adding microformat mark-up takes relatively little time and the potential benefits certainly outweigh the costs.

    • James Edwards

      I don’t see how that could possibly happen — the time element could mean any of a thousand things, of which events are only one possibility. We could never have enough HTML elements to describe all the semantics meanings of just one them.

      Adding bits and pieces of microdata is simple, but having a scheme which applies to every page of a site, that’s internally consistent, and meaningfully parseable, is actually quite complicated.

      If microdata is going to catch on, I think we need a sense of what it’s being used for, so that we can tailor the semantics we choose to their practical use-cases. Otherwise, it’s just pie in the sky.

      • Craig Buckler

        Agreed. And yes, the time element could mean anything but, if it were found in relation to an event, there’d be no need to give it a special class or attribute.

        I don’t think it’s necessarily complicated to use microdata/formats but, admittedly, it could get tricky if you don’t consider them at the start.

        • James Edwards

          What do you mean by “in relation to an event”? How is an “event” determined, other than by microdata?

        • http://newsviews.satya-weblog.com/ Satya Prakash

          Ya, if html is written many years back then applying microdata is complicated. It could be unnecessarily nested etc. and microdata types does not fit there.

  • Craig Buckler

    That’s the full expanded list. The primary categories are creative works, embedded media, events, medical, organization, person, place, product and review: http://schema.org/docs/schemas.html

    For example, you can have a comedy event and food event, but they’re still types of “event” and encoded the same way.

    • James Edwards

      Yeah there are few basic types, but each of those has lots of sub-types. If you’re going to categorize every article, script, game, song, book, magazine, album etc. etc. as “creative work” then you might as well not bother. There’s no point implementing microdata unless you going to be as specific as the schema you’re using will allow. And that gets complicated.

      What we’re really discussing here is whether microdata is simple enough that a) it’s not much work to implement, and b) it could eventually be superseded by native HTML5 semantics. And I’m saying that neither of those is the case.

      But let me put it another way — have you ever actually tried to implement microdata, across a whole site with all kinds of different content, which is managed by a CMS, in a way that’s consistent and as precise as the schema allows?

      Because I have, and it was not simple.

      And I still have no idea whether the microdata is useful and compatible with consumers of that data, because (as I said at the start) we don’t really know what consumers are doing with it (if anything).

      • Craig Buckler

        My point was that the hundreds of sub-types boil down to a dozen or so basic categories with similar structures. If you understand one type of event, you understand them all. I agree it won’t be a simple task on a vast CMS-driven site with lots of types — but microformats are fine for the average website.

        Going back to my original thoughts, I do expect microformats to become redundant but not because HTML semantics will improve. The ultimate bot will understand the context of content regardless of mark-up. That will happen but, for now, microformats provide a reasonable machine-readable format which will become increasingly important as systems begin to repackage information in different ways.

  • ichsie