PHP class/function to convert HTML to plain text

Just wondering if you ever now a PHP class or function which will get the HTML code of a website page and convert it into plain-text copy?

Thanks for your help.

Strip HTML tags do you mean? Otherwise its already plain text…

Yes, I want to strip HTML tags and make it plain formatted. For example h1 texts must be converted to:

title

That is a bit more than just striping the tags, you are asking to parse the tags in a different fashion. Much more complicated.

You may want to consider using a text-base browser (lynx, for example), and see if you can’t run it via exec() or system() sending the output to a file.

This may not be as simple as you imagine because most webpages contain lots of other data which are nothing to do with what you may term the “main content” (menus, navigation, adverts, tracking codes, sidebars etc).

Perhaps you have a limited set of target pages - in which case you can reasonably extract the equivalent of enclosing tags such as <div = “main_story”>(content you want is in here)</div>. You can then extract that by accessing the DOM, and then start trying to strip_tags() etc.

Take a look at this perhaps: http://htmlpurifier.org

It depends what you want to do, but you’ll need rules for each tag you want to render in a specific way.

This would be a start though:



function cleanHTML($html) {

	$html = str_replace('</h1>', "\\r\
==================</h1>", $html);
	$html = str_replace('</h2>', "\\r\
------------------</h2>", $html);

	return strip_tags($html);
} 


Great, thanks! This is what I’m looking for :slight_smile: