Regexp match cant match string for the life of me

No matter what I try i can’t get this to work

I have the following string

var innerHTML = "<strong>​<span id="cke_newspan_1508682170249551" style="color:#F9E500">​​​​​​a</span></strong><br>"

I need to match the letter ‘a’ in the middle. the letter can sometimes be a number as well, and is sometimes followed by either and . or a ) and then more text. there can sometimes be multiple span tags or other kinds of tags encasing this string as well.

for some reason a simple

/>([A-z])+\</.test( innerHTML );

won’t even return true. Any help would be appreciated. Thanks

Hi gmloosemore welcome to the forum

Try

var innerHTML = "<strong>​<span id="cke_newspan_1508682170249551" style="color:#F9E500">​​​​​​a</span></strong><br>"; 
console.log(innerHTML);

What does it show in the console?

2 Likes

Hi thanks for replaying so quickly!

Ok so after 4 hours of pulling my hair out I tried pasting the innerHTML string output from chrome console into Dreamweaver and it wouldn’t save, saying there are some characters not encoded. Strange, I thought. So i ran innerHTML through


		var bytelike= unescape(encodeURIComponent(innerHTML));
		var innerHTML= decodeURIComponent(escape(bytelike));

then logged innerHTML to the console again and sure enough, lots of strange characters. I don’t know much at all about encoding but i found

innerHTML = innerHTML.replace(/[^\x00-\x7F]/g, "");

and this seems to have solved my problem, now the RegExps are working as they’re supposed to. the innerHTML output comes from a rich text editor called CKEDITOR which I thought i had mostly figured out but this was a real curveball. Thanks for your help though

As a follow-up, could you maybe better explain what the problem is and how I could go about fixing it? Am I not in UTF-8 or something?

Well, hex x00 to x7F are the common Ascii characters

and that replace regex is removing characters that are beyond that.

Thing is, I’m not seeing any characters in that string that aren’t single byte so I don’t understand why that regex would be needed.

But it does sound like there is an encoding conflict somewhere. IMHO, the best way to avoid encoding problems is to make sure you have UTF-8 without BOM everywhere. (* assuming you don’t need higher)

Your text editor, HTTP headers, meta tags, database characters and collation, essentially everywhere you can specify it.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.