What I need to do is on WordPress, but all WordPress related stuff works as expected and as such my issue is PHP related.
Basically when you post to WordPress from Word (using xml-rpc), Word inserts <span>-tags width font-family (usually times new roman) and font-size (in px) for each paragraph. I have written a function/plugin to intercept the information before it is saved to the database, and remove it.
The content variably going into this is (shortened, line breaks are also where they are here):
<p><span style=\\"font-family:Times New Roman; font-size:12pt\\">16/08/2009
</span></p><p><span style=\\"font-family:Times New Roman; font-size:12pt\\"><strong>Votum:</strong> Ps.121:1
</span></p>
It does not currently work (the current code works when I re-save the post/page using WordPress, but not directly from the xml-rpc request).
I have also decided to maybe modify it so that it will first strip out
font-family
and
font-size
from a style attribute, then if the style attribute is empty, remove it; if there are no other attributes left, to remove the <span> as well as it’s closing tag, leaving any contained text/html in place.
I understand that this will require parsing it as html dom (something I have not done before). How would I go about doing it (or are there a script I can just incorporate).
If I get this to work, I will publish it as a simple to use WordPress Plugin.
/**
* To remove an attribute from an html tag
* @param string $attr the attribute
* @param string $str the html
*/
function remove_html_attribute($attr, $input){
//return preg_replace('/\\s*'.$attr.'\\s*=\\s*(["\\']).*?\\1/', '', $input);
$result='';
if(!empty($input)){
//check if the input text contains tags
if($input!=strip_tags($input)){
$dom = new DOMDocument();
//use mb_convert_encoding to prevent non-ASCII characters from randomly appearing in text
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', 'UTF-8'));
$domElement = $dom->documentElement;
$taglist = array('span'); //tags to check for specified tag attribute
foreach($taglist as $target_tag){
$tags = $domElement->getElementsByTagName($target_tag);
foreach($tags as $tag){
$tag->removeAttribute($attr);
}
}
//$result = $dom->saveHTML();
$result = innerHTML( $domElement->firstChild ); //strip doctype/html/body tags
}
else{
$result=$input;
}
}
return $result;
}
/**
* removes the doctype/html/body tags
*/
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
However, this will still leave the <span> tags behind.
Force Flow, this is specific to the XML-RPC posting.
The person that will load all the new posts, have very little time (like most of us). He also write everything in Word (for Printed handouts), knowing this I have looked at options to do this a lot quicker and have found that MS Word can publish directly to WordPress using the XML-RPC protocol (basically it sends an XML file with all the content and information, which is then used to create the post). Using this method, he does not have to leave the Word screen, log in, create new post, paste in the content (in the paste Word option) and re-format all the content (it is speeches of about an hour, twice a week) and then save - it will take only a moment to publish (using the XML-RPC) and he can then continue with whatever he needs to.
I have since posting, continued and the new preg_replace expression is
this appears to work (except that if the span contains anything else, it will not work, but there should be few enough to manually remove them).
I will also take a look at your example you posted, and see how I can make it remove the span if it does not contain any other attributes (your method is less error-prone than with my regex).