Need help with my code "cleaner"

The developers in my office constantly get MS Word files to convert to HTML to post on line. The newer versions of Dreamweaver make short work of this, as you can paste the document into the WYSYWIG section of the editor and the resulting code isn’t far off. This has worked well for a number of years.

Last year my office upgraded to Word 2007, and it has a lovely feature called “smart quotes”. Unfortunately, they’re not very smart. They’re all in the 8000 ascii range and they make a mess out of documents. Word 2007 also has high ascii characters for hyphens, dashes, etc. All horrible.

I recently built a code cleaning script that successfully changes the silly quotes to the HTML-friendly versions, and it’s getting a good bit of use. So much so that I’d like to extend it and make it do more.

One of the artifacts of the Word-Dreamweaver paste is that table rows and cells (th, td, tr) retain width and valign values. This is something we always have to get rid of, so I thought I’d add it to my cleaner.

So I started with something simple, like this:

<cfset result = #rereplace(result,"<td(.+?)>","<td>","ALL")#>

Which works fine. Too fine in fact, as it strips ALL attributes, of course, even the two we want to keep: rowspan and colspan.

My regex skills are bare novice, and my attempts at making Coldfusion save the rowspan and colspan values as variables, strip the tag, then plug them back in haven’t gone well.

So I need a better way to do this, or someone who knows regexes better than I do (which isn’t difficult, hah!) Any ideas?

I too have experienced this challenge. Anything short of using search/replace in Notepad++ or similar, I haven’t found a solution either. I’m sure a regular expression would do the trick, but I’m not very strong with them. A high quality cleaner just for programmers with this very problem would sure be useful!

I ran across someone on another forum who’s quite handy with regexes, and he told me that what I’m trying to do, in the way I want to do it, would require more than just a simple regex. This is because the order of the attributes can be anything - colspan or rowspan could come first, second, third, etc, and you could have both rowspan and colspan in the same tag.

So I think I’m going to slide this to the back burner for now.

Thanks so much for posting a follow-up message to let me know. I really do need to dedicate a few days full time to learning more about regexes… I’m certain it would make me more efficient (like for instance in this case I would have known that it’s infeasible because the tag sequence could pretty much be anything due to unlimited nesting tags, good point).

It’s too bad you just don’t want to always get rid of only width and valign attributes when present and nothing else. That would be relatively easy. Unfortunately sometimes the tedious and repetitive way is the only way.

Yeah, I need to get rid of width, height, valign, align, and maybe even some MS garbage.

I suppose I could just do multiple passes and replace things like width=“(.+?)” with nothing, and have a separate line for all the bad attribs. Then I could have a final one that replaced <td > with <td> and be done.