Another RegEx question, coz I know you guys just LOVE THEM. :)

Hello, all,

I’ve got a small bit of client-side JavaScript form validation that I am working on. Yes, I’m using server-side, too.

I’ve got a script that will trim whitespace from the beginning and end of the string, and will then remove any and all HTML tags, but will leave the content between the tags alone.

The good news: as long as there are both opening and closing HTML tags, it works great!

The bad news: if there is an open tag with no close tag, or vice-versa, it breaks.

Here’s what I have, so far:

function trimData(formObj){
    var elms = formObj.elements;
    for(var i=0; i<elms.length; i++){
        switch(elms[i].type){
            case 'text':
            case 'textarea':
                elms[i].value = elms[i].value.replace(/^\s*|\s*$/gi,'');
                elms[i].value = elms[i].value.replace(/<\s*[^>]*>(.*?)(<[\s|\/]*[^>]*>)/gim,'$1');
            break;
        }
    }
}

I’m pretty sure I’m not doing something correctly for the second backreference of the second replace. Any thoughts/ideas?

Thanks,

:slight_smile:

Yes, you’re attempting to use regex to parse HTML, which only leads to trouble, as is amply demonstrated in this answer on using regular expressions to match HTML tags:

Have you considered using innerText instead?

1 Like

You could parse the document via the Document Object Model rather than simply reading all the source.

There is an example of code to extract all the text nodes from a web page at http://javascriptexample.net/dom09.php

[quote=“Paul_Wilkins, post:2, topic:196304”]
Have you considered using innerText instead?
[/quote]Thanks, @Paul_Wilkins and @felgall, for your input. However, I’m not using this to change any .htm(l) pages. This is for form inputs (text and textarea) to try to strip out any malicious attempts at hacking via form.

V/r,

:slight_smile:

Okay… I think I’ve got it. I was making it way more complex than it really needed to be.

elms[i].value = elms[i].value.replace(/<\s|\/]*[^>]*>/gim,'');

This will remove just tags, regardless of open or close, and leave everything in between alone.

V/r,

:slight_smile:

UPDATE: I updated the regex to remove HTML entities, too.

elms[i].value = elms[i].value.replace(/(<\s|\/]*[^>]*>|&.{0,}?;)/gim,'');

You could make it even less complicated, and have the server pass all values intended for display through an htmlentities function (if using PHP for example) so that they are all safe for display and won’t end up being accidentally used as tags.

That’s the most reliable way of handling things, by using no JavaScript at all.

As indicated in my original post, I am using both client-side and server-side validation/sanitization. For the users who don’t disable JS (or run browsers that don’t support it), this is an additional measure. For those who do disable JS, the multiple server-side processes that handle the form input do even more to validate and sanitize than what I’m doing with JS. It’s more code to write, but the redundancy is, IMHO, worth the effort.

V/r,

:slight_smile:

How about don’t try to match pairs of open/close tags and instead just match tags.

.replace(/<[^>]*>)/gim,'')

That being said, I think stripping tags is almost always the wrong way to go. Doing so means your users can’t ever talk about <html>, and even some simple math statements would get stripped out (102<x, 4>2). So don’t strip tags. Instead, escape special HTML characters only at the moment when you output to HTML.

That’s what I wound up doing, after all… just get the tag, open or close, don’t worry about anything else. I also updated it to remove HTML entities like &lt; or &amp; or &#8211;.

This isn’t for a forum, or anything like that. This is for several forms where companies can submit company information to the US DoD for consideration in different programs. I am trying to reduce attack surface and attack vectors.

Thanks,

:slight_smile:

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.