Html parser

Does anyone know of a good, easy to use html parser. I’m just looking for something to remove unidentified and unwanted HTML tags to produce a cleaner HTML so that this can be used to be converted into xml.

I am using java code for this conversion tool. And right now i am using tidy.exe for this conversion. But tidy produces output only for cleaned HTML(as input). so i want to remove undesired tags in html to produce cleaned HTML.

Any suggestions would be greatly appreciated!

Thanks,
praveen

I’m just taking 5 min stab at this so this might not be the best solution.

  1. Create XSD w/ the structure that you want to capture from HTML. Most likely, it will look similar to http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd but you will omitt the unwanted HTML tags here.

  2. Use JAXB to generate the Java classes from the xsd.

  3. Convert your HTML into XML. Just add closing tags to img, input, and etc… for those special tags that doesn’t have closing tag.

  4. Load up the XML into JAXB. Only the tags you specified in the xsd will be captured into Java Bean.

  5. Done!

I am not much into java, so i did not get what you are trying to say.

In simple words:
I need to convert html files to xml. tidy does the parsing, provided if it is cleaned up html. but my html files are sometimes not cleaned. I need a way to clean up the html before sending it to tidy.