Marketing departments love website statistics. There’s nothing better than handing a CEO a freshly generated report showing how their website traffic is growing. That’s when the trouble starts.
Many people are under the misconception that web statistics are absolutely irrefutable: the numbers are generated by independent computers and cannot possibly be wrong. So why do statistics from two or more sources rarely match? To understand the problem, we need to examine the two methods used to collate and analyze statistics. Today, we look at server-side methods…
Server-Side Data Collection and Analysis
- the request type — normally GET or POST
- the full path name of the requested file
- the date and time
- the requester’s IP address, and
- the requester’s user agent; a string of characters which identifies the device, i.e. a specific OS and browser or a search engine bot.
Understandably, log files can grow to hundreds of megabytes even on relatively quiet websites.
The main benefits of server-based data collection is that it records every file request regardless of the technology used. It’s easy to assess the popularity of downloads or discover performance bottlenecks. Most servers produce log files by default, so it may be possible to access historical information about your traffic growth.
Unfortunately, there are a number of drawbacks:
- Large organizations often pass all internet requests through a single gateway. User identification becomes difficult when two or more users are sharing the same IP address and user agent string.
- The server logs cannot record cached files. Caching is essential and the Internet would grind to a halt without it. Your browser will cache files so, when you return to a page, it will show the files that were downloaded previously.
In addition, many ISPs cache popular website files on proxy servers. When you enter a web address, you may see files returned from that proxy rather than the originating website. As your site increases in popularity, you could even experience a drop in file access as more proxy servers cache your site.
Applications such as AWstats can analyze server log files to produce meaningful figures such as the number of unique users or visits. However, these applications must make assumptions when they interpret the data.
For example, the application could define a single “visitor session” as access from the same IP/user agent within the same 15 minute period. A user who visits a page then waits 16 minutes before clicking a link elsewhere would be recorded as two individual visitor sessions. But an application which assumed a 20 minute period of inactivity would only record only one visitor session.
If server-side data data collection and analysis is flawed, can client-side methods help us? View part 2 now…
Craig is a freelance UK web consultant who built his first page for IE2.0 in 1995. Since that time he's been advocating standards, accessibility, and best-practice HTML5 techniques. He's created enterprise specifications, websites and online applications for companies and organisations including the UK Parliament, the European Parliament, the Department of Energy & Climate Change, Microsoft, and more. He's written more than 1,000 articles for SitePoint and you can find him @craigbuckler.