SafeHTML – cleaning form input

Share this article

Reading a couple of web-related security books at the moment.

One is Apache Security, by Ivan Ristic (mod_security), who I got to meet again last weekend. Will save a long review for another time (I’m not finished reading yet), suffice to say this is a must read if you’re doing anything around Apache. Particularily PHP developers, who tend to see just their small part of the stack (“Apache is the hosts problem right?”).

Another is PHP-Sicherheit, a German publication, one of the authors being Christopher Kunz, who was at the conference, talking about Hardened PHP.


Also can’t say much about PHP-Sicherheit yet, other than I like what I’ve seen so far. What got me typing though was it’s mention of SafeHTML – an “anti-XSS HTML parser, written in PHP”, by Roman Ivanov, which I hadn’t seen before. In an odd way it’s kind of a product of Sitepointforums, given that it uses XML_HTMLSax, which basically got developed in this thread.

Now SafeHTML is acting as a filter, trying to strip out anything dangerous. The general view on the web is that it’s practically impossible to do this – there’s so many ways to sneak the word “javascript” in as the protocol to a link, for example, and IE is (sadly) very forgiving. I’ve only glanced at the code but so far it looks convincing. For example here’s how he deals with analyzing the link protocol (just the relevant bits of the class)…

    var $blackProtocols = array(
        'about',   'chrome',     'data',       'disk',     'hcp',     
        'help',    'javascript', 'livescript', 'lynxcgi',  'lynxexec', 
        'ms-help', 'ms-its',     'mhtml',      'mocha',    'opera',   
        'res',     'resource',   'shell',      'vbscript', 'view-source', 
        '',          'wysiwyg', 

        // ...

      foreach ($this->blackProtocols as $proto) {
          $preg = "/[sx01-x1F]*";
          for ($i=0; $i<strlen ($proto); $i++) {
              $preg .= $proto{$i} . "[sx01-x1F]*";
          $preg .= ":/i";
          $this->_protoRegexps[] = $preg;

Should match not just “javascript” but also “java script” and many other possible combinations containing ASCII control characters.

It also looks like it’s being smart about UTF-7 – haven’t examined that closely yet. Another good sign (odd as it may seem) is the “bug reports“, which have also been fixed.

Still not entirely convinced though – one thing that puzzles me is it’s taking all the decisions about what HTML get’s stripped for you. Will it cope with a table tag with a large width, that effectively breaks a design, for example (OK – that’s not XSS but…)? Still investigating… Would also be good to see this hosted somewhere like Berlios or Sourceforge.

Otherwise – side note (perhaps to Roman) – Jeff has since improved performance (over HTMLSax) with a new design, found here.

Harry FuecksHarry Fuecks
View Author

Harry Fuecks is the Engineering Project Lead at Tamedia and formerly the Head of Engineering at Squirro. He is a data-driven facilitator, leader, coach and specializes in line management, hiring software engineers, analytics, mobile, and marketing. Harry also enjoys writing and you can read his articles on SitePoint and Medium.

Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week
Loading form