Programming - - By James Edwards

Reducing HTTP requests with generated data URIs

I’m not really a server-side guy, but I do dabble here and there, and the other day I had this really neat idea I’d like to share. Of course it might be old-hat to you experienced PHP programmers! But then I hope you’ll be interested in my implementation, and who knows — maybe I’m about to make somebody’s day!

The idea is this: you can reduce the number of HTTP requests that a page has to make for its images, by pre-processing the source-code and converting them to data URIs. In fact, as long as the total amount of data involved doesn’t threaten PHP’s memory limit, you can reduce the number to zero!

The data URI scheme is a means of including data in web-pages as though it were an external resource. It can be used for any kind of data, including images, scripts and stylesheets, and is supported in all modern browsers: Gecko browsers like Firefox and Camino; Webkit browsers like Safari, Konqueror and Chrome; Opera, of course; and IE8 in a limited fashion (but not IE7 or earlier).

As Google soon atested though, I’m not the first to have had the idea of using them for page-optimization. But the implementations I saw all revolved around re-writing image paths manually, to point them to scripting, something like this:


<img src="<?php echo data_uri('images/darwinfish.png'); ?>" alt="Darwin Fish" />
   

What I’m proposing is a retrospective process that converts all the image paths for you, so you don’t have to do anything special when you’re authoring the page in the first place.

Code is where the heart is

The following example is a complete demo page, with original HTML and CSS, surrounded by PHP.

The page contains five <img> elements and one CSS background-image, yet in supported browsers it makes no additional HTTP requests at all:

<?php 
if($datauri_supported = preg_match("/(Opera|Gecko|MSIE 8)/", $_SERVER['HTTP_USER_AGENT'])) 
{ 
   ob_start(); 
}
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
   
   <title>Data URI Generator</title>
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   
   <style type="text/css">
   
      body
      {
         background:url(images/texture.jpeg) #e2e2dc repeat;
         color:#554;
      }   
      
   </style>
   
</head>
<body>
   
   <p>
      <img src="images/dropcap.jpg" alt="dropcap.jpg" />
      <img src="images/firefox.png" alt="firefox.png" />
      <img src='images/specificity.jpg' alt='specificity.jpg' />
      <img src='images/darwinfish.png' alt='darwinfish.png' />
      <img src="images/rolleyes.gif" alt="rolleyes.gif" />
   </p>
   
</body>
</html>
<?php
   
if($datauri_supported)
{
   function create_data_uri($matches)
   {
      $filetype = explode('.', $matches[2]);
      $filetype = strtolower($filetype[count($filetype) - 1]);
      
      if(!preg_match('/^(gif|png|jp[e]?g|bmp)$/i', $filetype))
      {
         return $matches[0];
      }
      
      if(preg_match('/^//', $matches[2]))
      {
         $matches[2] = $_SERVER['DOCUMENT_ROOT'] . $matches[2];
      }
   
      @$data = base64_encode(file_get_contents($matches[2]));
   
      return $matches[1] . "data:image/$filetype;base64,$data" . $matches[3];
   }
   
   
   $html = ob_get_contents();
   ob_end_clean();
   
   $html = preg_split("/r?n|r/", $html);
   while(count($html) > 0)
   {
      $html[0] = preg_replace_callback("/(src=["'])([^"']+)(["'])/", 'create_data_uri', $html[0]);
      $html[0] = preg_replace_callback("/(url(['"]?)([^"')]+)(["']?))/", 'create_data_uri', $html[0]);
   
      echo $html[0] . "rn";
   
      array_shift($html);
   }
}
   
?>

How this all works

The heart of all this is the ability to build data URIs using base64-encoded image data.

But over and above that, there’s a couple of key tricks we need to make this all work. Firstly, using the output buffer to pre-compile the output source, so we have a chance to parse it again before sending it to the browser.

You’ll see how, at the very top of the code, I’ve set a browser condition to decide whether to start the output buffer. That same condition is then used again, surrounding the main code at the bottom, so for browsers that don’t support this technique, all the scripting is bypassed and it just outputs the page as normal. Then what I’ve done is split the HTML by its line-breaks, so we can process, output then delete each line immediately, which avoids having to hold the entire source-code in memory.

Secondly, to implement the actual parsing I’ve used the preg_replace_callback function, which identifies HTML and CSS paths with a pair of regular-expressions, and passes them through a process too complex for a simple replacement. (We have to look for src attributes and url properties separately, because the syntax is too different for a single regex to generate identical match arrays.)

Within the callback function we first have to work out the file-type, which is needed for the output data, and also as a condition for allowed types, so we can reject anything that’s not an image (such as a script src). The $matches arrays that’s passed to the function always contains the entire substring match as its first member (followed by backreferences from [1]), so if we identify a file we don’t want we can just return that first match unmodified, and we’re done.

The only other thing to do after that is check for web-root paths, that will need prepending with DOCUMENT_ROOT to create a usable file path. Once we have all that, we can grab and encode the image (with error-suppression in case the original path was broken), then compile and return the data URI. Too easy!

When is an optimization not an optimization?

When the cost is greater than the saving! And of course a process like this doesn’t come for free — there are several potential costs we have to consider.

Images as data URIs are one-third larger than their original. Such images also won’t cache — or at least, they won’t cache as images, but they will cache as part of the source-code. Caching in that way is more of a broad-brush than a fine-point, but it does at least allow for offline viewing.

There’s also the processing-overhead of doing each conversion in the first place, more so as the file gets larger. There’s no significant latency involved in loading it, as long as it’s on the same server, but base64 encoding is a comparatively expensive process.

So perhaps the optimum way to use this technique would be, not to use it for all images, but only to use it for large numbers of small images, like icons and background-slices. The additional code-size would be trivial, and the processing light, but the benefit of removing several-dozen requests could be great, creating a faster-loading page overall.

So let’s say for example that you only wanted to process GIF images, that’s easily done just by modifying the allowed file-types regex:

if(!preg_match('/^(gif)$/i', $filetype))
{
   return $matches[0];
}

Alternatively, you could use the filesize function to filter by size, and only proceed to conversion for those below a certain threshold:

if(filesize($matches[2]) > 1024)
{
   return $matches[0];
}

As far as large images go, I have read that browsers place limits on the size of data URIs; however I haven’t observed any such limitations in practice. Firefox, Opera, Safari, even IE8 were all perfectly happy displaying image-data more than 1MB in size. Ramping-up the tests, I found myself hitting PHP’s memory limit without garnering any complaints from the browsers! Either I’m missing the point entirely, or there are no size limits.

Westward Ho!

While experimenting, I did try with JavaScript and CSS too; however that didn’t work in Internet Explorer, so I didn’t pursue it any further.

But a really interesting development from here, would be to see if it were possible to develop some kind of algorithm that calculates the cost vs. benefit of implementing this technique in different situations. Taking into account perhaps the size and complexity of the page itself, the number and size of each image, the ratio of repeating CSS images to ad-hoc content images, and the time it takes to convert and encode, compared with an average network request. Then bringing all that together to work out which images would benefit from conversion, and which are best left as they are. If we could do that, in a coherent yet automagical way, we’d have a pretty nifty WordPress plugin, hey!

But to be honest, I really don’t know where you’d start to work out something like that! There are several unquantifiables, and many judgement calls. It’s certainly something to think about though; and perhaps you — my dear reader — can offer some fresh insight?

Thumbnail credit: Stéfan

Sponsors