Regex Help Needed

I hate doing this but I’ve always fell short on regex, and since I try to stay away from it as much as possible I don’t get the exposure to it to become better. I’m stealing from http://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag. I need to allow for href= attributes in this expression. Possible?

$String = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\\/?)>/i",'<$1$2>', $String);

In order to allow for the attributes, you simply add the equals sign as a matching character (such as [a-z0-9=]).
If, on the other hand, you want to also capture the value of the ‘href’ attribute you get into the realm that causes the [often religious] argument AGAINST using Regular Expressions for parsing HTML.

By the way, this particular RegExp assumes the entire string consists of only lowercase characters. Using [a-zA-Z] or \w would be more thorough

My goal is to remove all attributes from any html tags in the given string, EXCEPT FOR href= attribute.

So modifying this to allow for upper and lowercase as you suggested I should show this?

$String = preg_replace("/<([a-z][A-Z][a-z0-9]*)[^>]*?(\\/?)>/i",'<$1$2>', $String);

I added [A-Z] after [a-z]

A more robust solution might be to use something like HTML Purifier.

Interesting. I think that may be a little over kill for what I’m looking to do, though. :slight_smile:

You actually should do it exactly as I stated (above)
[a-zA-Z] or you can use the shortcut \w

As I said, though, the syntax of HTML is so [very] loose that RegExp is often NOT the right solution. Unless you consider accomplishing it with multiple passes.

so this then?

$String = preg_replace("/<([a-zA-Z][a-z0-9]*)[^>]*?(\\/?)>/i",'<$1$2>', $String);

No, it does not due to the “case insensitive” tag added to the end (/[regex]/i)

With that said, the regex does not do what the OP wants, since it actually will allow all attributes instead of stripping them out.

Try this regex, please note that you should make certain that the regex tag for the url content contains all the symbols you require for matching your links.


$string = preg_replace("#<a [.*?]|(href="([\\w0-9%\\?&\\.=/\\\\ ]*)")[^>]*?>(.*?)</a>#i",'<a href="$2">$3</a>', $string);

Please note that I have not tested it, but it should work like that. If it don’t add a reply and let me know.