HTML Parsing to remove MS Word formatting from xml-rpc request

Hi All

What I need to do is on WordPress, but all WordPress related stuff works as expected and as such my issue is PHP related.
Basically when you post to WordPress from Word (using xml-rpc), Word inserts <span>-tags width font-family (usually times new roman) and font-size (in px) for each paragraph. I have written a function/plugin to intercept the information before it is saved to the database, and remove it.

The current code that does this is as follows:

$content = preg_replace('/<span\\sstyle="font-family\\s?:\\s?([^;]*)\\s?;\\s?font-size\\s?:\\s?([^;]*);\\s?">(.*?)<\\/span>/is','$3',stripslashes($content));

The content variably going into this is (shortened, line breaks are also where they are here):


<p><span style=\\"font-family:Times New Roman; font-size:12pt\\">16/08/2009
</span></p><p><span style=\\"font-family:Times New Roman; font-size:12pt\\"><strong>Votum:</strong> Ps.121:1
</span></p>

It does not currently work (the current code works when I re-save the post/page using WordPress, but not directly from the xml-rpc request).
I have also decided to maybe modify it so that it will first strip out

font-family

and

font-size

from a style attribute, then if the style attribute is empty, remove it; if there are no other attributes left, to remove the <span> as well as it’s closing tag, leaving any contained text/html in place.
I understand that this will require parsing it as html dom (something I have not done before). How would I go about doing it (or are there a script I can just incorporate).

If I get this to work, I will publish it as a simple to use WordPress Plugin.

The website runs on PHP5.4.

All help will be greatly appreciated.

Regards
Jacotheron

Try using CTRL+SHIFT+V when pasting content into the editing window of a post/page. That should strip out the formatting.

As for stripping the attribute after the fact, try this:

$content = remove_html_attribute('style', $content);
/**
 * To remove an attribute from an html tag
 * @param string $attr the attribute
 * @param string $str the html
 */
function remove_html_attribute($attr, $input){
    //return preg_replace('/\\s*'.$attr.'\\s*=\\s*(["\\']).*?\\1/', '', $input);


    $result='';


    if(!empty($input)){


        //check if the input text contains tags
        if($input!=strip_tags($input)){
            $dom = new DOMDocument();


            //use mb_convert_encoding to prevent non-ASCII characters from randomly appearing in text
            $dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', 'UTF-8'));


            $domElement = $dom->documentElement;


            $taglist = array('span'); //tags to check for specified tag attribute


            foreach($taglist as $target_tag){
                $tags = $domElement->getElementsByTagName($target_tag);


                foreach($tags as $tag){
                    $tag->removeAttribute($attr);
                }
            }


            //$result =  $dom->saveHTML();
            $result = innerHTML( $domElement->firstChild ); //strip doctype/html/body tags
        }
        else{
            $result=$input;
        }
    }


    return $result; 
}


/**
 * removes the doctype/html/body tags
 */
function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));


  return $doc->saveHTML();
}

However, this will still leave the <span> tags behind.

Force Flow, this is specific to the XML-RPC posting.

The person that will load all the new posts, have very little time (like most of us). He also write everything in Word (for Printed handouts), knowing this I have looked at options to do this a lot quicker and have found that MS Word can publish directly to WordPress using the XML-RPC protocol (basically it sends an XML file with all the content and information, which is then used to create the post). Using this method, he does not have to leave the Word screen, log in, create new post, paste in the content (in the paste Word option) and re-format all the content (it is speeches of about an hour, twice a week) and then save - it will take only a moment to publish (using the XML-RPC) and he can then continue with whatever he needs to.

I have since posting, continued and the new preg_replace expression is

/<span\\s?style="font-family\\s?:\\s?([^;]*);\\sfont-size\\s?:\\s?([^"]*)">(.*?)<\\/span>/ims

this appears to work (except that if the span contains anything else, it will not work, but there should be few enough to manually remove them).

I will also take a look at your example you posted, and see how I can make it remove the span if it does not contain any other attributes (your method is less error-prone than with my regex).

Thank you
Regards
Jacotheron