Regular expression with tags or not and both

Hello all, I hope things at this time are going OK for you and your families. I am working on a project, and I need some help with regular expressions. I am trying to create a regular expression that can remove certain types of HTML tags, another one that will ignore them and lastly, one that selects both types. I have made some attempts. However, there is always something missing. I am learning more about regular expressions in the process.

From here on out, I will refer to regular expressions as regex. I want one regex that catches all words surrounded by three types of tags. I only need these three. The tags I need to make part of my pattern are <em>, <span>, and <strong>. The <span> is the only one with attributes like, e.g. <span style="color: #3a9ee3;"> I also need to select the closing tags for each. In the example, I show I am highlighting what I want. However, I don’t think I am doing this in the best way.

At times I will need an inverted version of the request above. I need to select words that are not wrapped in any tag. Coming up with the correct regex for this has been more difficult. I seem always to be selecting something I do not want. I am always selecting an unexpected tag or a semicolon or something near the word. I sometimes need to select open and closed parenthesis. I have not been able to make the closed one work one time.

I need the last regex to select the words between the tags and those without any tag wrapped around. I also need the tags selected for the ones that have them.

Could somebody help construct this type of regex?

Here is a link to test at: https://regex101.com/r/5VUKsi/1

/<(em|strong|span)\s?.*?>(.*?)<\/\1>/g ?

1 Like

Hi, that regex is very general. I want something specific. I am trying to select a specific word when those tags surround it, and I want the tags selected also. If the tags do not surround it, I want it left alone. Thanks for the reply.

That’s what i gave you.

Take the captures from the first pattern, str_replace them with the empty string. What remains is the words not in any of those tags.

This is already done as part of the first regex - the whole pattern (pattern ‘0’) is the thing with the tags, pattern 1 is the name of the tag it matches, and pattern 2 is the contents of the tag.

If you need a specific word to occur inside the tag, change the .*? to .*?MySpecificWord.*?

When I say surrounded I mean surrounded and enclosed. When changed the .*? to .*?MySpecificWord.*? I get the first opening verions of the tags all the way (with all the text in the paragraph) to the last closing tag. Thanks

Works for me…

Hi, and thanks for all the help. Essential to what I match is the word alligator in different forms. You notice that all your versions also capture the word Lion when it is in tags. I think that is the challenge. Alligator represents dynamic text which will then make the pattern unique. To be clear, I only want to capture tags that have a specific word. Not anything wrapped in the tag. The words are replaced as needed.
Can you show how to write the result where it captures Alligator, alligator, Alligators, alligators, only? These words are only placeholders for all words will swap for alligator. However, when we change alligator to chicken, it should only select chicken wrapped in tags, not Lion or alligator wrapped in tags. Thanks so much

Ok to be honest you did say originally to capture any words inside of the three tags. Now you are talking about something slightly different where the word in the tags is also part of the pattern. For that you are going to have to have a pattern that best matches the words you do want to target. I don’t know how many words you plan on matching but you can take m_hutley’s pattern and just create the regex to match your words in addition to the tags.

I don’t know if you will be able to catch all variations either because the words themselves can have interesting variations. Sure “Alligator” and “Alligators” is only the difference by adding an “s” but if you want to match “Dingo” and “Dingoes” then the pattern has to be slightly different for that case.

You might be better off with a more generic pattern match to pull out the matches, then process the matches according to what unique requirements you have for the words you are targeting. Hopefully you get what I am saying.

If you only have a few set words and their variations are all similar, perhaps we can help you create the pattern for the words, but at this point we don’t know what words you would be targetting.

The original version I sent does it. However, I did so much guessing and trying to understand what I would be missing. I kind of just wanted someone with more experience to look at and see if the regex could be written more reliable and concisely.
I should also add that I do not want to match “Dingo” and “Dingoes” it would be [Dd]ingo(s)? only because regex allows for this. Sometimes it will only be dingo but never a different spelling.
Thanks

I appreciate the great help and advice on the first part. Now I would like to get some help on the second part. That a regex that would select only the specific word when it is not wrapped in <em> <strong> or <span>. The <span> with has a color attribute a lot of the time. The challenge I have had with this one is working close to the tags, e.g., <p>, <div>, and their closing tags. Thanks

I think at this point you’re wandering beyond regex and probably would find better luck with a DOMParser or even an XML interpreter.

Could you name some? Thanks

PHP: DOMDocument::loadHTML - Manual
DOMDocument would be the PHP inbuilt system for DOM manipulation. You would essentially have to filter down to the node types required, and then check each for their content containing the desired word.

The problem with Regex is that it doesnt really do context well. It doesnt understand that part of the string is inside a tag or not, it simply has “a string”, which it then can do evaluations about.