Blog Post RSS ?

Blogs » PHP » SafeHTML - cleaning form input
 

SafeHTML - cleaning form input


  • Save to
    Del.icio.us

by Harry Fuecks

Reading a couple of web-related security books at the moment.

One is Apache Security, by Ivan Ristic (mod_security), who I got to meet again last weekend. Will save a long review for another time (I’m not finished reading yet), suffice to say this is a must read if you’re doing anything around Apache. Particularily PHP developers, who tend to see just their small part of the stack (”Apache is the hosts problem right?”).

Another is PHP-Sicherheit, a German publication, one of the authors being Christopher Kunz, who was at the conference, talking about Hardened PHP.

SafeHTML

Also can’t say much about PHP-Sicherheit yet, other than I like what I’ve seen so far. What got me typing though was it’s mention of SafeHTML - an “anti-XSS HTML parser, written in PHP”, by Roman Ivanov, which I hadn’t seen before. In an odd way it’s kind of a product of Sitepointforums, given that it uses XML_HTMLSax, which basically got developed in this thread.

Now SafeHTML is acting as a filter, trying to strip out anything dangerous. The general view on the web is that it’s practically impossible to do this - there’s so many ways to sneak the word “javascript” in as the protocol to a link, for example, and IE is (sadly) very forgiving. I’ve only glanced at the code but so far it looks convincing. For example here’s how he deals with analyzing the link protocol (just the relevant bits of the class)…


    var $blackProtocols = array(
        'about',   'chrome',     'data',       'disk',     'hcp',     
        'help',    'javascript', 'livescript', 'lynxcgi',  'lynxexec', 
        'ms-help', 'ms-its',     'mhtml',      'mocha',    'opera',   
        'res',     'resource',   'shell',      'vbscript', 'view-source', 
        'vnd.ms.radio',          'wysiwyg', 
        );

        // ...

      foreach ($this->blackProtocols as $proto) {
          $preg = "/[\s\x01-\x1F]*";
          for ($i=0; $i<strlen ($proto); $i++) {
              $preg .= $proto{$i} . "[\s\x01-\x1F]*";
          }
          $preg .= ":/i";
          $this->_protoRegexps[] = $preg;
      }

Should match not just “javascript” but also “java script” and many other possible combinations containing ASCII control characters.

It also looks like it’s being smart about UTF-7 - haven’t examined that closely yet. Another good sign (odd as it may seem) is the “bug reports“, which have also been fixed.

Still not entirely convinced though - one thing that puzzles me is it’s taking all the decisions about what HTML get’s stripped for you. Will it cope with a table tag with a large width, that effectively breaks a design, for example (OK - that’s not XSS but…)? Still investigating… Would also be good to see this hosted somewhere like Berlios or Sourceforge.

Otherwise - side note (perhaps to Roman) - Jeff has since improved performance (over HTMLSax) with a new design, found here.

This post has 14 responses so far

  1. Yes, heard about SafeHTML from somewhere. When was looking at doing something similar, I used PHP5s DOM loadHTML() and then transformed it with an XSLT which is basically a whitelist of templates of allowed elements and attributes, discarding everything else.

     
  2. Well it’s “hosted” at pearweb http://pear.php.net/package/HTML_Safe but not in the sense you talked about, it’s not even the most recent version … the maintainer seems a bit too lazy to maintain cvs version and release to pearweb.

    Would be nice if someone could convince him to do that ;D (I had hard enough time to get him to release the inital release after his package was accepted into PEAR)

     
  3. safeHTML should pass all the XSS examples on http://ha.ckers.org/xss.html

     
  4. http://hvge.sk/scripts/tagwall/ Not documented, but really worth a try. Works really good on some websites I know.

     
  5. safeHTML should pass all the XSS examples on http://ha.ckers.org/xss.html

    Playing around with the demo, looks OK. Thanks for the link - that really needs turning into some unit tests, to run against these sort of projects.

    http://hvge.sk/scripts/tagwall/ Not documented, but really worth a try. Works really good on some websites I know.

    Like the first impressions there - nice code. Not sure it’s doing quite the same thing though - seems more focused on stripping particular HTML tags rather than XSS prevention. Will look further.

     
  6. I guess this needs link to as well: http://blog.bitflux.ch/wiki/XSS_Prevention

     
  7. How does this compare to something like KSES?

    http://sourceforge.net/projects/kses

     
  8. KSES I had heard of - never looked at it too deeply but, from, seems to be a serious attempt.

    OK - when I get some time will do a comparison.

     
  9. What about Input Filter on http://cyberai.com/inputfilter/

     
  10. […] Harry Fuecks blogs about cleaning up form input which is something I need to look into. One of the comments points to PHP Input Filter which looks like it does things the simple way, i.e. I may be able to use it . . . February 21st 2006 Posted to Code […]

     
  11. Helgi Þormar: HTML_Safe is most recent version, such as SafeHTML.

    I update both packages simultaneously.

     
  12. Whipped up a little template class based on your idea Harry, of using short tags with htmlentities: http://www.sitepoint.com/forums/showpost.php?p=2529188&postcount=53

    - matt

     
  13. […] For comment markup, what to we want to point Tim at? As mentioned before, SafeHTML (packaged under PEAR as HTML_Safe) would allow posting raw HTML, perhaps with help from tidy to make sure it’s XHTML. There is PHP Markdown (don’t know much about this e.g. security record / UTF-8 handling) for a fairly standard markup. Alternatively Dokuwiki’s parser could be extracted (with a little hacking)—shouldn’t harm UTF-8 and shouldn’t result is broken XHTML. What else? […]

     
  14. rvlbjrkf

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.