Tidy HTML

Tweet

John Coggeshall has posted this slides from the International PHP Conference here.

At first glance Tidy may seem like nothing more than a nice tool for the pedantic. At second glance I start to think Tidy may be the biggest new piece of functionality in PHP for a long time; one we’re going to be thanking John for again and again.

Take a look at this slide for example – Tidy is smart enough to be able to extract legacy HTML “styling” tags and convert them to CSS. Perhaps that should be no surprise, because the underlying C library originally began with Dave Raggett (father of the HTML

tag, among many other things).

There’s also other nice features like the Word 2000 mode which suggest a tool written for the real world.

Where Tidy gets more exciting for PHP, IMO, is it enables conversion of HTML to a format ready for XML parsing. To an extent, that means it’s almost irrelevant what the HTML your PHP scripts spit out looks like – Tidy can convert it (see ob_tidyhandler) to XHTML and from there it can be transformed further with XSLT.

Note there’s also a tutorial on Tidy @ Zend here.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Derick

    I think you’re highly overreacting here about its usefulness. Of course, tidy is useful… but I didn’t find any reason *why* I should need this. (Good) Designers make XHTML standard compliant websites anyway, and for the others.. .they don’t really care about standards anyway.

  • http://www.gregorybair.com pionar

    (Good) Designers make XHTML standard compliant websites anyway, and for the others.. .they don’t really care about standards anyway.

    Yeah, ok, you’re telling me that every page you’ve ever written has started out and stayed perfect. Yeah right. Tidy is very useful for cleaning up code after testing, etc.

  • http://www.phppatterns.com HarryF

    “I think you’re highly overreacting here about its usefulness”

    It was Monday morning so thats possible, having had a good weekend. But there’s further things I like about Tidy;

    - HTML parsing. Right now, with PHP 4.x, the only options are DOM extension or a library implemented in PHP (which will be slow). The more options for parsing HTML, the better IMO. With the Tidy “Word mode” for example, I can allow users to publish HTML using Word, Tidy it then grab the body using the body() method. Another application of Tidy’s ability to parse could be templating.

    - Brings PHP closer to the medium it’s most commonly used with. That’s more a philosophical point but think it’s possible for PHP to “do more” as a web technology, and make life easier for users. Valkyrie seems like a step in the same direction. Currently messing about with libcroco in spare moments (there’s never enough), while working out how to write PHP extensions, which might make another useful tool.

    Although new applications may be presented with XHTML, think Tidy opens alot of possibilities for “reshaping” older HTML apps.

  • texdc

    Wow! This is incredibly useful! Why? Think about getting a POS website that was coded last year without any regards to standards and looks like code from 1996. You could run all 1500 .html files through Tidy and, voila!, you have completely parseable, standard XHTML files. You could even extract all the content, assuming the original files had some sort of structure, and dump it into a DB! Woo hoo!

    Of course you could also just tag Tidy onto the final output stage of any current CMS or forum or shop. Too cool for school!

    Just my $.02!

  • http://www.lastcraft.com/ lastcraft

    “I think you’re highly overreacting here about its usefulness”

    I can think of a few uses. Tools that screen scrape, tools like sitemesh (HTML integration), web service wrappers on legacy or remote content, XSLT style transformations on HTML steams, web site testing (SimpleTest2 will use it), template languages that are still HTML rather than mode switching (add a CSS class=”widget” to get a PHP widget instead) to work with DreamWeaver, extending HTML with your own tag variations, on-line page editing like a Wiki to make a content management tool and CSS redundancy checks (off line).

    I could probably think of some more if I put my mind to it :).

    yours, Marcus

  • Zed

    Here’s how its useful. You work in a startup where there are not any coding standards and looking through code written by former “architects” is like looking through your back yard for a particular blade of grass. However, given the aparent abilities of Tidy you can seemingly clean up the spaghetti-mess so you can proceed to get some work done.

  • chregu

    You don’t need tidy for doing XSLT with non-wellformed HTML documents, since the dom extension can also parse HTML:

    In PHP5 (similar is possible with PHP 4) do the following:

    $dom = new DomDocument();
    $dom->loadHTMLFile(“http://www.php.net”);
    $xsl = new DomDocument();
    $xsl->load(“yourxslt.xsl”);
    $proc = new XsltProcessor();
    print $proc->transformToXml($dom);

    Have fun

    chregu

  • http://www.phppatterns.com HarryF

    web site testing (SimpleTest2 will use it)

    Good point – hadn’t thought of that. The error reporting Tidy offers seems to be excellent so it’s also a tool to tell you what fix.

  • Ben

    This is a godsend for anyone integrating PHP with Flash/Actionscript!