Tidy broken HTML

I have no idea if this is the right forum for this question but here we go…
It’s about the Linux program tidy. I try to use it to fix a broken piece of HTML. The HTML contains MS Word rubbish. I try to use the ‘-modify’ option to change the file in place but because of the errors within the HTML it won’t perform the tidy operation. Here’s what some of the HTML looks like:

<font color="#cc0000">
<span style="font-family: Verdana">
<p><font size="3" color="#000000">&nbsp;</font></p>
</span>
<span style="font-family: Verdana">I have to spend Thursday in <city w:st="on"><place w:st="on">Westminster</place></city>, on business.
I usually travel to <city w:st="on"><place w:st="on">London</place></city>.
</span>
</font>

And here’s how I invoke the tidy command:


$ tidy -version
HTML Tidy for Linux/x86 released on 1 September 2005
$ tidy tidy.test.lite

And the output is this:


line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 2 column 1 - Warning: missing </span> before <p>
line 1 column 1 - Warning: missing </font> before <p>
line 3 column 4 - Warning: inserting implicit <font>
line 3 column 4 - Warning: inserting implicit <span>
line 4 column 1 - Warning: discarding unexpected </span>
line 5 column 64 - Error: <city> is not recognized!
line 5 column 64 - Warning: discarding unexpected <city>
line 5 column 80 - Error: <place> is not recognized!
line 5 column 80 - Warning: discarding unexpected <place>
line 5 column 108 - Warning: discarding unexpected </place>
line 5 column 116 - Warning: discarding unexpected </city>
line 6 column 21 - Error: <city> is not recognized!
line 6 column 21 - Warning: discarding unexpected <city>
line 6 column 37 - Error: <place> is not recognized!
line 6 column 37 - Warning: discarding unexpected <place>
line 6 column 60 - Warning: discarding unexpected </place>
line 6 column 68 - Warning: discarding unexpected </city>
line 8 column 1 - Warning: discarding unexpected </font>
line 1 column 1 - Warning: inserting missing 'title' element
line 2 column 1 - Warning: trimming empty <span>
line 1 column 1 - Warning: trimming empty <font>
Info: Document content looks like HTML 4.01 Transitional
18 warnings, 4 errors were found!

This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

You are recommended to use CSS to specify the font and
properties such as its size and color. This will reduce
the size of HTML files and make them easier to maintain
compared with using <FONT> elements.

To learn more about HTML Tidy see http://tidy.sourceforge.net
Please send bug reports to html-tidy@w3.org
HTML and CSS specifications are available from http://www.w3.org/
Lobby your company to join W3C, see http://www.w3.org/Consortium

The question is, how do I get tidy to stop whining about the errors and fix them too?

You might want to try some online facilities to tidy up your html.

http://valet.htmlhelp.com/tidy/
http://validator.aborla.net/

If you want to do it locally, take heed of the warnings and repair each one manually.

Really?!?! Do I need to parse the error output?

(Not interested in doing it via the web)

Hi!

There is an option to force the output, when you got errors like:
“Error: <myowntag> is not recognized!”

force-output:
http://tidy.sourceforge.net/docs/quickref.html#force-output
“This option specifies if Tidy should produce output even if errors are encountered. Use this option with care - if Tidy reports an error, this means Tidy was not able to, or is not sure how to, fix the error, so the resulting output may not reflect your intention.”

HTH,
+Robi