Serving compact html - almost cracked it

Hi all,

I’m looking through the function I use to serve compact html to save bandwidth in the long run.

I’ve got it down to near perfect except when I look at the source passed to the browser it’s still got a lot of “…>” gap here “<…>”, “–>” gap here “<!–”, “</div>” gap here “<div…” gaps in there that might as well go and lend back some more saved bytes if I force them to.

In my function I have the following as well (which should affect the above) but they don’t seem to be doing their job:

$bff=str_replace(“> <”,“><”,$bff);
$bff=str_replace(“> <”,“><”,$bff);

What am I missing? I’ve also got lines like below but they only go so far, the gaps above are usually 1-3 chars wide at most.

$bff=str_replace(array(“\r\r\r\r”,“\r\r\r”,“\r\r”,“\r
“,”
\r”,"



“,”


“,”

“),”
“,$bff);
$bff=str_replace(”\ “, “”, $bff);
$bff=str_replace(array(”\ \ \ “,”\ \ “,”\
“,”
\ “),”\ ",$bff);

Thanks,

The differences in bandwidth will be very, very small; the first, biggest thing you can do to save bandwidth is make sure all images are optimised and that your HTML is as semantic and concise as possible - not just by spaces, but by not using tags which aren’t required etc.

As for your PHP output, well the following lines should help:


$bff = preg_replace("/[\\r\
\	]{2,}/", '', $bff);
$bff = preg_replace("/>\\s+</", '><', $bff);

The first line replaces any occurance of a combination of one or more of \r,
and \ (e.g. \r
, \r
\ , \
, \ \r, \r\ etc) with nothing - you can change that to a newline if you so wish.

The second replaces any size gap between tags with nothing.

biggest thing you can do to save bandwidth is make sure all images are optimised and that your HTML is as semantic and concise as possible

That actually increases the bandwidth. If you use <strong>, <em>, <br /> instead of <b>, <i>, <br> (more characters).

True, text/html is smaller than any image or indeed any other such multimedia content but as I’ve implemented it I might as well do it properly. 10,000 or so requests down the line and these savings will add up into somewhat meaningful numbers.

Thank you for the code, I will implement it within the respective php file that gets loaded at the start of each page.

Just tried it. Very nice indeed…just an odd gap here or there it seems. I’ve spotted it still omitting “/> some text goes here…” hence leaving a gap between the > and the first letter of whatever text follows. Apart from that and always leaving one gap before a “/>” and a single gap after a “</div>” it’s excellent now.

Would I be right in saying the “2” in the first line of code above means up to 2 instances of n, r or t in the code? If so I can just change that up to 4 and remove the below lines completely:

$bff=str_replace(array(“>
<”,“>

<”,“>


<”,“>\r\<”,“>\r\r<”,“>\r\r\r<”,“>\ <”,“>\ \ <”,“>\ \ \ <”),“><”,$bff);
$bff=str_replace(array(“>
<!”,“>

<!”,“>


<!”,“>\r<!”,“>\r\r<!”,“>\r\r\r<!”,“>\ <!”,“>\ \ <!”,“>\ \ \ <!”),“><!”,$bff);
$bff=str_replace(array(“–>
<!–”,“–>

<!–”,“–>


<!–”,“–>\r<!–”,“–>\r\r<!–”,“–>\r\r\r<!–”,“–>\ <!–”,“–>\ \ <!–”,“–>\ \ \ <!–”),“–><!–”,$bff);

Thanks,

You’re going to drastically reduce the readability of your source HTML. IMO it’s an anti-social thing to do on the internet.

Why don’t you use on-the-fly gzip? Most modern browsers support it and it reduces the size of the HTML delivered without compromising readability (and it’s smaller than just removing whitespace). Win-win. :slight_smile:

Right but I only compact the PHP/HTML when it’s served to the browser. The only people who will want to look at the source via the browser will be those who want to see how the site is designed. Any real reason to make their lives easier?

Instead of gzip I’m thinking of the Apache deflate module (Apache 2.x) which is said to be faster than gzip but as yet I’m having problems enabling it on my host (gives out 500 error). That’s another issue, people seem to like what it does (performance wise).

It will certainly make your life easier when you have problems :stuck_out_tongue:

@Risoknop - It depends to which extent you take semantic. Technically <b> is just as semantic as <strong> and <br /> instead of <br> is simply preference :slight_smile:

By semantic I mean only have elements where you need elements. Strictly there is no need for a master div - the body can hold everything.

Ideally you want to serve the MINIMUM html possible - and make the rest up with css.

For example, a typical minimal webpage could be:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"> 
    <head> 
        <title>Your Web Page</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <link rel="stylesheet" type="text/css" href="style.css" />
    </head>
    <body>
        <h1>YourSite</h1>
        <ul class="Menu">
            <li><a href="/" title="Home">Home Page!</a></li>
            <li><a href="/Articles" title="Articles">Articles</a></li>
            <li><a href="/Forums" title="Forums">Forums</a></li>
        </ul>
        <div class="Content">
            <h2>Title</h2>
            <p>In the beginning...</p>
        </div>
    </body>
</html>

Add a bit of CSS to that and you can have it looking as good as you like :slight_smile:

My point was that a lot of people use bloat in their code without even realising it - by staying minimal, you are reducing your bandwidth quite alot.

It has to be said, though, that images will be a much bigger drain than HTML.

Apache’s deflate is the way to go - Also make sure your images are optimized.

Yes, I’ve put all the PNG’s through two different PNG optimisers and they have shrunk their footprint somewhat.

The pages load up quickly actually, it’s just the benchmark test I did that said it was slow, but they measure it next to a 56K modem, as if anyone is still using that, those were the fun years :slight_smile: The benchmarks says that on a 2Mb connection (average now I reckon) it will take 2.5s to load so that’s good enough. Some pages are beefier than others but hey, up to 3.5s is within norm, and with ADSL2+/DOCSIS 3.0 ramping up nicely in coming years it will go down to 0.5s or less anyway.

As for getting all the spaces from the html, I just like to do what I do properly, not skipping bits here or there if I can help it. I would never do it to my php/html files directly though, editing that would be a nightmare. Processors are fast enough these days to spare a few cycles doing it dynamically when the page loads.

I’ll do the deflate module in Apache 2.x route asap, just waiting for my host to say why its producing a 500 on their Apache when the code is basically copied from two different sources (one which is the Apache website) and other people have said it works fine for them. So far the below (as an example) is not wanting to play ball:

<Location />
SetOutputFilter DEFLATE
SetEnvIfNoCase Request_URI \
\.(?:gif|jpe?g|png)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \
\.(?:exe|t?gz|zip|gz2|sit|rar)$ no-gzip dont-vary
</Location>

P.S. I have left out the BrowserMatch/AddOutputFilterByType commands so far.

Thanks guys.

@Risoknop - It depends to which extent you take semantic. Technically <b> is just as semantic as <strong> and <br /> instead of <br> is simply preference

By semantic I mean only have elements where you need elements. Strictly there is no need for a master div - the body can hold everything.

Ideally you want to serve the MINIMUM html possible - and make the rest up with css.

For example, a typical minimal webpage could be:
html Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en">
    <head>
        <title>Your Web Page</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <link rel="stylesheet" type="text/css" href="style.css" />
    </head>
    <body>
        <h1>YourSite</h1>
        <ul class="Menu">
            <li><a href="/" title="Home">Home Page!</a></li>
            <li><a href="/Articles" title="Articles">Articles</a></li>
            <li><a href="/Forums" title="Forums">Forums</a></li>
        </ul>
        <div class="Content">
            <h2>Title</h2>
            <p>In the beginning...</p>
        </div>
    </body>
</html>

Add a bit of CSS to that and you can have it looking as good as you like

My point was that a lot of people use bloat in their code without even realising it - by staying minimal, you are reducing your bandwidth quite alot.

It has to be said, though, that images will be a much bigger drain than HTML.

I agree with everything you said except the first sentence. <strong> is much more semantic than <b>. In fact, <b> has absolutely zero semantic meaning (b = bold, so it just says the text will be bold). <strong> has semantic meaning (text inside <strong> is supposed to be more important, to be emphasized, the fact that text inside the <strong> tag will be displayed by browsers as bold is just a technicality because it isn’t said it should be bold in specifications).

My point was that a lot of people use bloat in their code without even realising it - by staying minimal, you are reducing your bandwidth quite alot.

I think that’s really a minimal bandwidth loss, badly optimized images will use up 1000 times more useless bandwidth than even the ugliest and most bloated HTML code.

Right. Sorry to jump into your reply…but from a SEO perspective I don’t think there’s any semantic significance between <strong> and <b> so in the end it simply boils down to our own programming preferences, agreed?

That was my initial thought, until I read some posts in the design forum when discussing semantics.

The major point that caught my attention is that ‘bold’ is a meaning of its own. Bold text belongs in classical typography, just like italics. It’s not necessarily a property of a certain span of text, but holds enough importance to have a tag associated.

I suppose that’s all a matter of debate, really :stuck_out_tongue:

However, the relevance to this thread is questionable (due to the differences between <b> and <strong> having negligible bandwidth strain) - so as to keep this on topic I’m afraid we’re going to have to cut that debate short :rolleyes:

There is quite a lot about semantics in the design forum. All I ever get from it are that the picky things are subject to personal opinion. However the big things - such as tables vs thought-out code - can have fairly large impact on the size of a file.

From a SEO perspective the world of the web is a horrible, horrible place :stuck_out_tongue:

Try not to base a lot of decisions on the impacts of SEO - a well thought out page can thrash an SEO-optimised page in the search engines. It’s more a collaboration of good content and good HTML - but if you have good content then your HTML can be horrific :lol:

Another thing that can save you lots of bandwidth is CSS and JavaScript (especially JS). Once you have tested your JS code, minimize and compress it (especially libraries like jQuery, MooTools etc). You can do the same with CSS (for example a 30KB stylesheet can sometimes be compressed to less than 10KB).

Then of course use mod_deflate. Probably the most important step towards minimizing bandwidth usage.

mod_deflate might not be working because you need to add at least these lines to /etc/apache2/conf.d/deflate:


AddOutputFilterByType DEFLATE text/html text/plain text/xml

BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\\.0[678] no-gzip
BrowserMatch \\bMSIE !no-gzip !gzip-only-text/html

Call me anti-social if you will, but my primary concern is 99.9% of my users who just want surf the web and appreciate webpages that load really quickly. The one person out of a thousand who looks at my HTML source is not my concern.

Personally I strip out as much of the white space formatting as I can as PHP spits out the HTML and then deflate the resulting HTML file along with all the rest of my text based files before shipping them of to the browser. Both minifying the HTML source and compressing the resulting file gives a way bigger data transfer reduction than doing just one or the other.

You don’t even want to see the JS code my PHP script spits out it is an unreadable wall of text that mostly uses one and two character variable and function names. :rofl:

Oh and yes squeezing every last byte out of image files is important to. Just don’t assume that PNG files will always be the smallest. With really tiny files I’ve found GIF files can consistently be smaller. The trick is to try both GIF and PNG and then see which is the smallest for a given image.

Well I’d certainly hate to be you when it comes to debugging :stuck_out_tongue:

Fair enough. I guess it’s more “important” on websites that are more likely to be visited by web development nerds. I guess it depends on how much of a hippy or a business-person you are about the internet. Some people are fanatical about its openness, the idea of sharing, freedom of speech/information, etc., while others see it as nothing more than a business tool.