Programming - - By Stephen Pierzchala

Compress Web Output Using mod_gzip and Apache

Web page compression is not a new technology, but it has recently gained higher recognition in the minds of IT administrators and managers because of the rapid ROI it generates. Compression extensions exist for most of the major Web server platforms, but in this article I’ll focus on the Open Source Apache and mod_gzip solution.

GZIP-Encoding Basics

The idea behind GZIP-encoding documents is very straightforward. Take a file that is to be transmitted to a Web client, and send a compressed version of the data, rather than the raw file. Depending on the size of the file, the compressed version can run anywhere from 50% to 20% of the original file size.

In Apache, this can be achieved using Content Negotiation, which requires that two separate sets of HTML files be generated: one for clients who can handle GZIP-encoding, and one for those who can’t. This solution sends gzip-encoded files to clients who understand them, but does not allow for the compression of dynamically-generated pages.

A More Graceful Solution

A more graceful solution is the use of mod_gzip, one of the many additional modules available for Apache. I consider it one of the overlooked gems for designing a high-performance Web server. Using this module, configured file types will be compressed using GZIP-encoding after they’ve been processed by all of Apache’s other modules, and before they’re sent to the client. The compressed data that’s generated reduces the number of bytes transferred to the client, without any loss in the structure or content of the original, uncompressed document.

mod_gzip can be compiled into Apache as either a static or dynamic module -- I've chosen to compile it as a dynamic module in my own server. The advantage of using mod_gzip is that this method doesn't require anything to be done on the client side in order to make it work. As for the server side, all the server or site administrator has to do is:
  • compile the module,
  • edit the appropriate configuration directives that were added to the httpd.conf file,
  • enable the module in the httpd.conf file, and
  • restart the server.

In less than 10 minutes, you can be serving HTML files using GZIP-encoding.

How it Works

When a request is received from a client, Apache determines if mod_gzip should be invoked by noting whether the "Accept-Encoding" HTTP request header has been sent by the client. If the client sends the header (shown below), mod_gzip will compress the output of all configured file types when they're sent to the client.

Accept-encoding: gzip

This client header announces to Apache that the client will understand files that have been GZIP-encoded. mod_gzip then processes the outgoing content and includes the following server response headers.

Content-Type: text/html 
Content-Encoding: gzip

These server response headers announce that the content returned from the server is GZIP-encoded, but that when the content is expanded by the client application, it should be treated as a standard HTML file. Not only is this successful for static HTML files, but it can also be applied to pages that contain dynamic elements, such as those produced by Server-Side Includes (SSI), PHP, and other dynamic page generation methods. You can also use it to compress your Cascading Stylesheets (CSS) and plain text files. My httpd.conf file sets the following configuration for mod_gzip:

mod_gzip_item_exclude         file       .js$ 
mod_gzip_item_exclude         mime       ^text/css$

mod_gzip_item_include         file       .html$
mod_gzip_item_include         file       .shtml$
mod_gzip_item_include         file       .php$
mod_gzip_item_include         mime       ^text/html$

mod_gzip_item_include         file       .txt$
mod_gzip_item_include         mime       ^text/plain$

mod_gzip_item_include         file       .css$
mod_gzip_item_include         mime       ^text/css$

I've had limited success compressing other file formats, mainly because Microsoft's Internet Explorer appears to examine the "Content-Type" header message before it examines the "Content-Encoding" header message. So, say you configure your server to GZIP-encode PDF files using the following mod_gzip directives:

mod_gzip_item_include         file       .pdf$ 
mod_gzip_item_include         mime       ^application/pdf$

This will work perfectly in both Mozilla and Opera, as these applications decode the GZIP-encoded content before they pass it along to the PDF reader (most people use Adobe Acrobat Reader).

However, Internet Explorer simply passes the GZIP-encoded content directly to the PDF reader. Once this issue is rectified in the MSIE code, you are likely to see a lot more Web servers serving a broader range of GZIP-encoded content.

Bandwidth Savings

As you can see, GZIP-encoded documents can produce substantial savings in bandwidth usage: 
Uncompressed File Size:  3122 bytes
Compressed File Size:  1578 bytes  
Uncompressed File Size:  56279 bytes  
Compressed File Size:  16286 bytes

As a server administrator, you may be concerned that mod_gzip will place a heavy burden on your systems as they compress files on the fly. I'd like to point out that this does not seem to concern the administrators of Slashdot, one of the busiest Web servers on the Internet, who use mod_gzip in their very high-traffic environment.

The mod_gzip project page is located at SourceForge. Try it out for yourself.