Why Your Website Statistics Reports Are Wrong, Part 1

web site statistics pie chartMarketing departments love website statistics. There’s nothing better than handing a CEO a freshly generated report showing how their website traffic is growing. That’s when the trouble starts.

Many people are under the misconception that web statistics are absolutely irrefutable: the numbers are generated by independent computers and cannot possibly be wrong. So why do statistics from two or more sources rarely match? To understand the problem, we need to examine the two methods used to collate and analyze statistics. Today, we look at server-side methods…

Server-Side Data Collection and Analysis

Every time you visit a website, the web server records information about every file request, i.e. the HTML file, CSS files, JavaScript files, graphic files, Flash movies, PDF documents, MP3 music, etc. Implementations differ, but most servers record each request on a single line of a log file. The data normally includes:

  • the request type — normally GET or POST
  • the full path name of the requested file
  • the date and time
  • the requester’s IP address, and
  • the requester’s user agent; a string of characters which identifies the device, i.e. a specific OS and browser or a search engine bot.

Understandably, log files can grow to hundreds of megabytes even on relatively quiet websites.

The main benefits of server-based data collection is that it records every file request regardless of the technology used. It’s easy to assess the popularity of downloads or discover performance bottlenecks. Most servers produce log files by default, so it may be possible to access historical information about your traffic growth.

Unfortunately, there are a number of drawbacks:

  • Very little can be determined about the user’s browsing device. User agent strings offer minimal information and can be faked (Opera used to pretend to be IE to ensure sites did not block the browser). You cannot normally assess the user’s screen resolution settings or whether they had JavaScript and Flash enabled.
  • Large organizations often pass all internet requests through a single gateway. User identification becomes difficult when two or more users are sharing the same IP address and user agent string.
  • The server logs cannot record cached files. Caching is essential and the Internet would grind to a halt without it. Your browser will cache files so, when you return to a page, it will show the files that were downloaded previously.

    In addition, many ISPs cache popular website files on proxy servers. When you enter a web address, you may see files returned from that proxy rather than the originating website. As your site increases in popularity, you could even experience a drop in file access as more proxy servers cache your site.

Applications such as AWstats can analyze server log files to produce meaningful figures such as the number of unique users or visits. However, these applications must make assumptions when they interpret the data.

For example, the application could define a single “visitor session” as access from the same IP/user agent within the same 15 minute period. A user who visits a page then waits 16 minutes before clicking a link elsewhere would be recorded as two individual visitor sessions. But an application which assumed a 20 minute period of inactivity would only record only one visitor session.

If server-side data data collection and analysis is flawed, can client-side methods help us? View part 2 now…

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.starsites.co.za Jacotheron

    My host have AWstats as a default feature and what I have noticed is that it calculates the unique visitors by their IP.
    I do agree that statistics are usually wrong (it is believed that even 80% of all statistics are wrong), but it still gives you an idea (relative) of what is busy happening to the site (or what the statistics are about).
    Great Post.

  • http://www.optimalworks.net/ Craig Buckler

    Thanks Jacotheron.

    Statistics have their uses, but I’m always wary when they’re pounced on by marketing departments who do not understand the processes or meaning behind the data. People like glossy reports and believe them – and that could be dangerous to your business.

  • netteran

    Use Google Analytics, and there will be no false data :) or alternatively Piwik which allowes us to control and store all the data on our server.

  • http://www.pixeline.be pixeline

    That’s an excellent – and welcomed- article. Stats are all the rage nowadays, because it’s the supposed metrics for ROI when the conversion is not about sells or subscribers.

    Stats are good to show the evolution of your size of traffic flow, trends, and to visualise the impact of parallel marketing actions. But it does not have any “scientific” value. It’s estimates, at best.

  • www.websitedesign.co.uk

    This is a good article as it explains that there are always factors that will affect whatever metric you are trying to measure.

  • Ahmad Alfy

    Well I use a service called Woopra. It’s a live stats app give you immediate feedbacks and alerts about visitors on your website. It’s pretty awesome and I think It cover most of the drawbacks you were talking about. It identify the visitor and give him an ID and if he return back even month later he’s marked as a returning visitor… Whatever how he stays inactive, he’s not new anymore :)

  • Peter

    Good article. That is why we set up google analytics for each one of our client sites.

    Actually I believe it does not matter where your clients are getting your stats from, as long as it is done from the same place consistently as for smaller businesses, it does not matter.

    The beauty about google analytics is that the weekly results are pushed to the recipient, and they don’t have to actually do anything but open a .pdf – we have found over the years that despite teaching clients how to read for example awstats, the dont bother unless it is mission critical, so an “executive summary” weekly from google is ideal for 90%

    Peter

  • jumpingdogdesign

    How about all them spambots that visit? Do they skew results?

  • http://www.optimalworks.net/ Craig Buckler

    Hi Ahamd,

    Woopra looks good, but it uses a client-side script to collect usage data (like Google Analytics). That will solve some of the caching issues you experience with server-side collation, but there are still several problems you need to be aware of. My next article will explain more…

  • http://www.optimalworks.net/ Craig Buckler

    @jumpingdogdesign
    That’s a great point.

    Spambots will cause a hit in server-side logs. If they set a ‘normal’ useragent, such as IE8 on Windows, then it’s not possible to identify them (unless they’re always from a known IP address which you can filter out).

    So yes, spambots will skew results somewhat. However, I’d hope your user to spambot ratio is high enough to overcome the noise.

  • http://www.optimalworks.net/ Craig Buckler
  • http://www.scriptsdesk.com jakab

    One of the good thing is that we have do more than one analytics.If we do that it is very useful to match the queries.Thanks for providing such a great article.