HTML Parsing to remove MS Word formatting from xml-rpc request
What I need to do is on WordPress, but all WordPress related stuff works as expected and as such my issue is PHP related.
Basically when you post to WordPress from Word (using xml-rpc), Word inserts <span>-tags width font-family (usually times new roman) and font-size (in px) for each paragraph. I have written a function/plugin to intercept the information before it is saved to the database, and remove it.
The current code that does this is as follows:
The content variably going into this is (shortened, line breaks are also where they are here):
$content = preg_replace('/<span\sstyle="font-family\s?:\s?([^;]*)\s?;\s?font-size\s?:\s?([^;]*);\s?">(.*?)<\/span>/is','$3',stripslashes($content));
<p><span style=\"font-family:Times New Roman; font-size:12pt\">16/08/2009
</span></p><p><span style=\"font-family:Times New Roman; font-size:12pt\"><strong>Votum:</strong> Ps.121:1
It does not currently work (the current code works when I re-save the post/page using WordPress, but not directly from the xml-rpc request).
I have also decided to maybe modify it so that it will first strip out and from a style attribute, then if the style attribute is empty, remove it; if there are no other attributes left, to remove the <span> as well as it's closing tag, leaving any contained text/html in place.
I understand that this will require parsing it as html dom (something I have not done before). How would I go about doing it (or are there a script I can just incorporate).
If I get this to work, I will publish it as a simple to use WordPress Plugin.
The website runs on PHP5.4.
All help will be greatly appreciated.