Why Your Website Statistics Reports Are Wrong, Part 1
Marketing departments love website statistics. There’s nothing better than handing a CEO a freshly generated report showing how their website traffic is growing. That’s when the trouble starts.
Many people are under the misconception that web statistics are absolutely irrefutable: the numbers are generated by independent computers and cannot possibly be wrong. So why do statistics from two or more sources rarely match? To understand the problem, we need to examine the two methods used to collate and analyze statistics. Today, we look at server-side methods…
Server-Side Data Collection and Analysis
Every time you visit a website, the web server records information about every file request, i.e. the HTML file, CSS files, JavaScript files, graphic files, Flash movies, PDF documents, MP3 music, etc. Implementations differ, but most servers record each request on a single line of a log file. The data normally includes:
- the request type — normally GET or POST
- the full path name of the requested file
- the date and time
- the requester’s IP address, and
- the requester’s user agent; a string of characters which identifies the device, i.e. a specific OS and browser or a search engine bot.
Understandably, log files can grow to hundreds of megabytes even on relatively quiet websites.
The main benefits of server-based data collection is that it records every file request regardless of the technology used. It’s easy to assess the popularity of downloads or discover performance bottlenecks. Most servers produce log files by default, so it may be possible to access historical information about your traffic growth.
Unfortunately, there are a number of drawbacks:
- Very little can be determined about the user’s browsing device. User agent strings offer minimal information and can be faked (Opera used to pretend to be IE to ensure sites did not block the browser). You cannot normally assess the user’s screen resolution settings or whether they had JavaScript and Flash enabled.
- Large organizations often pass all internet requests through a single gateway. User identification becomes difficult when two or more users are sharing the same IP address and user agent string.
- The server logs cannot record cached files. Caching is essential and the Internet would grind to a halt without it. Your browser will cache files so, when you return to a page, it will show the files that were downloaded previously.
In addition, many ISPs cache popular website files on proxy servers. When you enter a web address, you may see files returned from that proxy rather than the originating website. As your site increases in popularity, you could even experience a drop in file access as more proxy servers cache your site.
Applications such as AWstats can analyze server log files to produce meaningful figures such as the number of unique users or visits. However, these applications must make assumptions when they interpret the data.
For example, the application could define a single “visitor session” as access from the same IP/user agent within the same 15 minute period. A user who visits a page then waits 16 minutes before clicking a link elsewhere would be recorded as two individual visitor sessions. But an application which assumed a 20 minute period of inactivity would only record only one visitor session.
If server-side data data collection and analysis is flawed, can client-side methods help us? View part 2 now…