Cache or Check?
Slow loading pages are the biggest grievance that today’s Internet users have. Given that the majority of people are still on a modem dialup this affects a lot of users. The "56k" label on your modem stands for "56000 bits (of data) per second". Since one character is around 10 bits in size a (static) 5k file should take approximately 1 second to load. But realistically, the speed of your connection on an analogue phone line is going to be around 40k, and if you consider that the index page of the Sitepoint Forums is around 80k in size, on a 40k dialup it’ll take around 20 seconds to load.
What this means is that surfing the net can be extremely frustrating as you wait eternities for a page to slowly appear before your eyes – it can be even worse if the page you’re loading is image-heavy.
Time is of the Essence
One tool that an Internet Service Provider (ISP) can employ to help overcome this problem is the use of a cache (pronounced "cash"). The concept of the cache has been employed in computer science for many years now and is one reason why today’s computers appear to be as quick as they are.
What is a Cache?
Essentially all a cache does is store copies of (or pointers to) previously accessed data. The main implementation in computer architecture is to use a small area of very fast memory (SRAM) to store copies of recently accessed information from your main memory (RAM) or hard drive, which are a lot slower.
For example, open a fairly large local file (say around 500k). Depending on the speed of your system, the first time you access this it could take anywhere between 2 and 10 seconds to open, as the computer looks for the application it needs to open the file, then checks the file to make sure it’s OK, and then finally opens it (and this could take even longer if you have anti-virus software installed).
Now, close the file, then open it again, and it should appear magically before your eyes a lot faster. Why? Because your system’s built-in cache has remembered the file, and knows exactly where to get it from your hard drive.
The same applies to the Internet. Your ISP inevitably uses some sort of cache, for the simple reason that it improves the speed at which Web pages are delivered to your screen. If you have ever wondered how the Internet works, here is a basic synopsis. The user dials-up to a server, which is "plugged-in" to the Net. The user then types in the URL (Uniform Resource Locator, or, the address) of a page they want to view. This sends a request to the server of the ISP. This server then looks up the specified address across the Internet. If it doesn’t find anything, it returns an error (normally an Error 404 Document Not Found). If it finds a matching address, it retrieves from the host server a copy of the document you want, and returns it to your browser, displaying it on your screen.
Admittedly all this happens in a matter of nanoseconds, but assuming your ISP has found the file you’re after, this is when the speed of your dial-up comes into play. Big files equal slow download times and a Web cache can speed this up by sitting between you and your ISP.
How Does a Cache Work?
Consider the routine that just occurred — you sent a request for a file, the server then went and hunted for the file, and if it found it, it fetched a copy for you. Now this would seem to be the only way for such an operation to work, but in fact it’s pretty inefficient. Imagine that at any one time there might be thousands of users all requesting the same page. Without a cache, the ISP’s server has to keep going back to the same address and getting the document for each individual request. However, if a Web cache is used, the routine is altered slightly.
A Web cache is basically a large hard drive where copies of documents are stored. So with a Web cache in use, the operation to retrieve a Web page changes. It now works like this: The first time there’s a request for a page, the server of your ISP has a look in the cache for a copy of that page. But as this is the first request for the page, the serverwon’t find it, so it busies off to the actual address and returns a copy to you the user – but this time saves a copy of the file in its cache. Now, the second time the file is requested, the ISP’s server again looks in the cache, and hey presto – there is the requested file! It then simply sends the copy back to the user.
As you can see, this process is a lot more efficient, as the time taken to fetch a file is dramatically reduced, and the server can go back to finding and delivering other pages, rather than having to go and hunt around on the Net for this frequently-used page.
Seems simple doesn’t it!?
Well, the concept is, but in practice it can get tricky. The above example describes what happens for dial-up users. Other stages exist in this process too, though. For example most Windows 9x and NT users will have a cache on their machine, labelled either Cache or the "Microsoft friendly" term, Temporary Internet Files.
To test this, find a site on the Internet which has a static HTML file. Let the page load fully, then close your browser, and kill your Internet connection. Open your browser again, ensuring it is set to "Work Offline" and put the Web address back into your browser. If all is working as it should, then the page should be displayed, even though you have no current connection to the Internet.
We can see then that the routine for viewing a Web page has changed again. The process is now like this: Page request from your computer, check your computer cache, then to ISP, ISP checks its cache, then finally it goes and gets the page from the actual origin. Getting worried? I’ll cover the problems this produces later on.
Not on a modem?
This process is for single users on a dialup connection. Consider then the situation of a large business where there are maybe 1-500 people all wanting to get onto the Internet. Obviously each user doesn’t have a modem in their machine, because that would require a phone line for every user. Instead, the users are all on a Local Area Network (LAN) which has a very high speed connection to the Internet. The requests for a page on the Internet are sent across the network to a proxy server (or router) which then deals out the requests.
For any firm this Internet connection is likely to be one of the biggest chunks of the company’s communications expenditure, as leasing lines for high speed Internet connection sharing can be extremely pricey. The reason for this is bandwidth.
Anyone who pays for Website hosting will be familiar with this concept. It’s essentially a limit on the amount of data that a data line is able to transfer at once. For each page that’s requested, its file size is deducted (or accumulated depending on how you look at it) from/to the amount of allowed bandwidth. For example, your host might say you have 4gb of bandwidth a month. This means that your 30k Webpage can be requested 133333 times (or thereabouts). If your page gets requested more frequently than that, you’ll be faced with a bandwidth overage charge.
The same principle applies to the leasing of telephone/data lines for corporate usage, which applies especially to international bandwidth.
Well believe it or not, these proxy servers can also be configured to include the use of a cache. I was lucky enough to speak to one of the Systems Administrators for a large corporate LAN, who has just invested a substantial amount of money in a new cache facility.
Question 1: Why do you use a cache?
- Locally held objects are much faster to fetch than remote objects, therefore the network receives Web content faster.
- Cost. Currently the company is charged for the amount of international bandwidth used, and by using caching this cost can be minimised.
Question 2: What is the capability of the cache (i.e. how big is it)? How is it organised?
We have (currently) 2 Dell 2550 machines, with 1.5GB RAM and 18GB of cache space each. Together they handle around 8 million Web requests daily. We also peer with other caches at Nottingham and the UK national cache (if requested Web pages are in these caches, they are fetched from there instead of directly from sites).
Question 3: How often are the cached pages updated?
The ‘churn’ of the cache is seven days, though the cache checks pages on a regular basis to see if newer versions are available. This is also controlled by sites themselves who can set a low expiry times on pages. Also, dynamic content is never cached.
We can see, then, that the route for a Web page has changed again. It is no longer request a page and your browser magically returns it, it goes something like this:
Request for a page > PC cache > Proxy Cache > ISP Cache > Destination
…with the possibility of adding in local/regional cache centres, as well as a National cache centre and feasibly an international cache centre (talks are in progress!).
So potentially there are 7 different hard drives that have to be scanned before the page you wanted to view is returned to you. And unless a cache facility is configured and indexed properly, searching a cache can be a very time consuming business. With ISPs being in the state they are at the moment, cheap and speedy is the only way to attract customers. Therefore they must use a cache to deliver Web pages quickly and minimise the bandwidth charges they pay to their telecoms provider.
But sadly this often results in slower Internet access! Why? Because the caches are so large that virtually every requested page is cached, and the server has to trawl through millions of pages checking to see if it has a copy of the page.
The Other Problem with Cache Facilities
I’m sure this has happened to you at one stage or another: you request a page and it looks old. You haven’t been given the latest version because the cache has done a quick scan, seen the page address and returned it to you. This is normally easily fixed with a hard refresh (Ctrl & F5 on a Windows/IE machine) assuming the cache is configured accordingly (i.e. to check all the attributes [time, file size and so on] rather than just the name of the file.
How can you tell if your ISP uses a cache?
Well, the simplest way is to check your connection settings. If you are on a Windows machine, have a look at Internet Options > Connection and click Settings. If the boxes under Proxy Server are filled in, then you’re probably using a cache, which is more than likely resulting in slower download speeds. You can fix this by disabling the proxy server, which will force your browser to bypass the proxy cache.
Maximising your computer’s cache
As I said earlier, the way a cache affects you is purely a personal thing (and is going to vary dramatically on the speed of your ISP), but you can disable caching on your local machine, by going to Internet Options > Temporary Internet Files > Settings and dragging the slider down to 1mb. The other thing to do is to every now and then empty the contents of the Temporary Internet Files folder. Remember though, this might mean your Internet connection is slower, because it has to then retrieve a fresh copy from the server.
The opposite of this, then, is a dynamic page, which can be considered to be a page that does something server-side, and which often includes the use of database. The process of retrieving a dynamic page is different in that when the page is requested, instead of automatically retrieving a copy of it, a script within the page is executed and returns the results of that. Consider using a search facility – you ask the search engine for something, it runs off, works its magic, and presents you the related material. If this process was all client-side it would simply present you with one big page that contained all the several billion links it has. Because of this, dynamic pages are not cached (there are ways around this – see this article if you use PHP).
Another type of page that’s not cached is a secure page, for example anything that uses the HTTPS protocol (instead of the HTTP protocol that regular pages use). These pages aren’t cached for obvious reasons: you don’t want sensitive details (credit card information, email addresses, passwords etc.) saved.
When asked about this, the systems admin replied:
"If our users properly configure their machines using our instructions, then https, or secure traffic, goes direct out of the network, bypassing the cache. Even if users chose (and they have the choice) to send https traffic through the cache then it simply cannot be cached since it is all encrypted. The cache simply ‘proxies’ the data."
So you can rest assured that caching does not impose a security threat.
Embrace The Cache?
In my experience, the caching facilities employed by the major ISPs (for example Freeserve, in the UK) should be avoided, simply because the cache centres are so large, that page delivery is significantly slowed down. However, caching locally is a good thing. Having a look at your Internet Options > Temporary Internet Files (in IE) will tell you what’s going on with your PC.
The effects of this will vary from user to user and depend on your type of connection, but generally for fastest browsing it should be set to "Automatically", whereas if you want to be sure you have the most up-to-date page, select "Every visit to the page".
If you use any kind of routing/proxy software then you have ultimate control over the caching. In my house in Southampton we have three computers that access the Internet via a proxy server. The proxy has been configured to only cache certain things, like *.jpg, *.gif and *.swf, as these take a longer time to download, and are often reused across pages. Since the actual page itself is not cached, I can be assured that the page itself is new, that if the page has changed any new images will be downloaded, but any images that aren’t new are simply displayed from the cache.
AOL employs a similar thing in the way it handles images. Every image it displays is automatically compressed and cached, which is why images viewed in the AOL browser can often look blocky compared to another browser. It is possible to turn this option off, but remember to clear your Temporary Internet files after doing this.
Controlling Cache as a Webmaster
As Web developers, it is important to ensure the usability of your Website is maximised. This means that you want your page to be delivered exactly as you planned it, and not rely on your users thinking to refresh in the case of an old design or out-of-date content. There are a couple of tricks you can use to ensure that an up-to-date copy of your page is sent.
The first is quite simple, by using the appropriate headers. There are a number of META tags you can use to control the caching:
<META HTTP-EQUIV="cache-control" CONTENT="no-cache">
<META HTTP-EQUIV="pragma" CONTENT="no-cache">
<META HTTP-EQUIV="expires" CONTENT="Wed, 26 Feb 1997 08:21:57 GMT">
The first tag is pretty obvious, but the second is important because it’s designed to stop caching of pages that are refreshed. To use the last one, simply set a date that is in the past, so that your browser thinks the page has expired, and will automatically grab a new copy (even if it is has not changed).
The second trick is to use dynamic pages, if you can, as these are not cached. Even if there is no database or dynamic content involved, it is possible to stop caching of pages by appending some variable to all your pages, such as a timestamp or a random number. This will fool the browser into thinking that the page is different, and grab a new copy every time, a good example of which can be seen at www.genie.co.uk.
To Cache or….?
The use of caching facilities in Web page delivery is one of the most debated topics on Web usability today. On the plus side, a Web cache can incerase the speed of delivery of Web pages, but on the downside, improperly configured caches can leave the user with old, out-of-date pages that make little sense. As a Webmaster, you must consider your content and user base, and if necessary make the appropriate changes to ensure that your content is always delivered without caching.
For more information on caching, have a look at the CacheNow! project.