Removing JacaScript event attributes

Hi,

I have some code to clean up an HTML document before doing additional processing. One of the steps in cleaning up the HTML document is to remove all JavaScript event attributes from HTML tags (such as onclick, onblur, etc). I have the following code but it seems to have problems when the JavaScript contains a \“. I’m not so great with regular expressions so I’m not really sure how to have it exclude the \” sub-pattern. Any help on how to make this regex better would be appropriated!


$html = preg_replace('#(onabort|onactivate|onafterprint|onafterupdate|onbeforeactivate|on
beforecopy|onbeforecut|onbeforedeactivate|onbeforeeditfocus|onbeforepaste|onbefo
reprint|onbeforeunload|onbeforeupdate|onblur|onbounce|oncellchange|onchange|oncl
ick|oncontextmenu|oncontrolselect|oncopy|oncut|ondataavaible|ondatasetchanged|on
datasetcomplete|ondblclick|ondeactivate|ondrag|ondragdrop|ondragend|ondragenter|
ondragleave|ondragover|ondragstart|ondrop|onerror|onerrorupdate|onfilterupdate|o
nfinish|onfocus|onfocusin|onfocusout|onhelp|onkeydown|onkeypress|onkeyup|onlayou
tcomplete|onload|onlosecapture|onmousedown|onmouseenter|onmouseleave|onmousemove
|onmoveout|onmouseover|onmouseup|onmousewheel|onmove|onmoveend|onmovestart|onpas
te|onpropertychange|onreadystatechange|onreset|onresize|onresizeend|onresizestar
t|onrowexit|onrowsdelete|onrowsinserted|onscroll|onselect|onselectionchange|onse
lectstart|onstart|onstop|onsubmit|onunload)\\s*=\\s*".*?"#is', '', $html);

Thanks!

Use htmlpurifier. Trying to parse html and javascript using regular expressions is very difficult to do correctly.

This might be helpful.

I’ll second this with the addendum that if you’re filtering input from untrusted sources (guest users) you’re probably better off using a bbcode library. Direct attachment of event handlers like this isn’t the only way to attach javascript events to an object.

Htmlpurifier looks pretty solid. Unfortunatly it only parses anything within the body tags :(. I’m writing something where I need to parse an entire web page. Basically I want to strip out the javascript in order to clean out some crud before the parser attempts to extract data from the page.

I’m not familiar enough with it to tell you what to do, but I’m sure you just need to tweak it a bit. There’s TONS of configuration options and extensibility. They have a forum.

From the HTML Purifier configuration documentation it appears that the head section is not supported at all.

From the HTML.allowedElements section:

Note that this method is subtractive: it does its job by taking away from HTML Purifier usual feature set, so you cannot add a tag that HTML Purifier never supported in the first place (like embed, form or head).