SitePoint Sponsor

User Tag List

Results 1 to 3 of 3

Hybrid View

  1. #1
    SitePoint Enthusiast
    Join Date
    Oct 2008
    Location
    Pretoria, South Africa
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question HTML Parsing to remove MS Word formatting from xml-rpc request

    Hi All

    What I need to do is on WordPress, but all WordPress related stuff works as expected and as such my issue is PHP related.
    Basically when you post to WordPress from Word (using xml-rpc), Word inserts <span>-tags width font-family (usually times new roman) and font-size (in px) for each paragraph. I have written a function/plugin to intercept the information before it is saved to the database, and remove it.

    The current code that does this is as follows:
    PHP Code:
    $content preg_replace('/<span\sstyle="font-family\s?:\s?([^;]*)\s?;\s?font-size\s?:\s?([^;]*);\s?">(.*?)<\/span>/is','$3',stripslashes($content)); 
    The content variably going into this is (shortened, line breaks are also where they are here):
    Code HTML4Strict:
    <p><span style=\"font-family:Times New Roman; font-size:12pt\">16/08/2009
    </span></p><p><span style=\"font-family:Times New Roman; font-size:12pt\"><strong>Votum:</strong> Ps.121:1
    </span></p>

    It does not currently work (the current code works when I re-save the post/page using WordPress, but not directly from the xml-rpc request).
    I have also decided to maybe modify it so that it will first strip out
    Code:
    font-family
    and
    Code:
    font-size
    from a style attribute, then if the style attribute is empty, remove it; if there are no other attributes left, to remove the <span> as well as it's closing tag, leaving any contained text/html in place.
    I understand that this will require parsing it as html dom (something I have not done before). How would I go about doing it (or are there a script I can just incorporate).

    If I get this to work, I will publish it as a simple to use WordPress Plugin.

    The website runs on PHP5.4.

    All help will be greatly appreciated.

    Regards
    Jacotheron

  2. #2
    Barefoot on the Moon! silver trophy Force Flow's Avatar
    Join Date
    Jul 2003
    Location
    Northeastern USA
    Posts
    4,603
    Mentioned
    56 Post(s)
    Tagged
    1 Thread(s)
    Try using CTRL+SHIFT+V when pasting content into the editing window of a post/page. That should strip out the formatting.

    As for stripping the attribute after the fact, try this:

    Code:
    $content = remove_html_attribute('style', $content);
    Code:
    /**
     * To remove an attribute from an html tag
     * @param string $attr the attribute
     * @param string $str the html
     */
    function remove_html_attribute($attr, $input){
        //return preg_replace('/\s*'.$attr.'\s*=\s*(["\']).*?\1/', '', $input);
    
    
        $result='';
    
    
        if(!empty($input)){
    
    
            //check if the input text contains tags
            if($input!=strip_tags($input)){
                $dom = new DOMDocument();
    
    
                //use mb_convert_encoding to prevent non-ASCII characters from randomly appearing in text
                $dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', 'UTF-8'));
    
    
                $domElement = $dom->documentElement;
    
    
                $taglist = array('span'); //tags to check for specified tag attribute
    
    
                foreach($taglist as $target_tag){
                    $tags = $domElement->getElementsByTagName($target_tag);
    
    
                    foreach($tags as $tag){
                        $tag->removeAttribute($attr);
                    }
                }
    
    
                //$result =  $dom->saveHTML();
                $result = innerHTML( $domElement->firstChild ); //strip doctype/html/body tags
            }
            else{
                $result=$input;
            }
        }
    
    
        return $result; 
    }
    
    
    /**
     * removes the doctype/html/body tags
     */
    function innerHTML($node){
      $doc = new DOMDocument();
      foreach ($node->childNodes as $child)
        $doc->appendChild($doc->importNode($child, true));
    
    
      return $doc->saveHTML();
    }
    However, this will still leave the <span> tags behind.
    Visit The Blog | Follow On Twitter
    301tool 1.1.5 - URL redirector & shortener (PHP/MySQL)
    Can be hosted on and utilize your own domain

  3. #3
    SitePoint Enthusiast
    Join Date
    Oct 2008
    Location
    Pretoria, South Africa
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Force Flow, this is specific to the XML-RPC posting.

    The person that will load all the new posts, have very little time (like most of us). He also write everything in Word (for Printed handouts), knowing this I have looked at options to do this a lot quicker and have found that MS Word can publish directly to WordPress using the XML-RPC protocol (basically it sends an XML file with all the content and information, which is then used to create the post). Using this method, he does not have to leave the Word screen, log in, create new post, paste in the content (in the paste Word option) and re-format all the content (it is speeches of about an hour, twice a week) and then save - it will take only a moment to publish (using the XML-RPC) and he can then continue with whatever he needs to.

    I have since posting, continued and the new preg_replace expression is
    Code:
    /<span\s?style="font-family\s?:\s?([^;]*);\sfont-size\s?:\s?([^"]*)">(.*?)<\/span>/ims
    this appears to work (except that if the span contains anything else, it will not work, but there should be few enough to manually remove them).

    I will also take a look at your example you posted, and see how I can make it remove the span if it does not contain any other attributes (your method is less error-prone than with my regex).

    Thank you
    Regards
    Jacotheron


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •