SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    help needed removing MS Word HTML from a file

    Hi,

    I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc.

    I found this code to remove Word HTML

    PHP Code:
    $search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript 
                     
    "'<[\/\!]*?[^<>]*?>'si",           // Strip out html tags 
                     
    "'([\r\n])[\s]+'",                 // Strip out white space 
                     
    "'&(quot|#34);'i",                 // Replace html entities 
                     
    "'&(amp|#38);'i"
                     
    "'&(lt|#60);'i"
                     
    "'&(gt|#62);'i"
                     
    "'&(nbsp|#160);'i"
                     
    "'&(iexcl|#161);'i"
                     
    "'&(cent|#162);'i"
                     
    "'&(pound|#163);'i"
                     
    "'&(copy|#169);'i"
                     
    "'&#(\d+);'e");                    // evaluate as php 

    $replace = array (""
                      
    ""
                      
    "\\1"
                      
    "\""
                      
    "&"
                      
    "<"
                      
    ">"
                      
    " "
                      
    chr(161), 
                      
    chr(162), 
                      
    chr(163), 
                      
    chr(169), 
                      
    "chr(\\1)"); 

    $content preg_replace ($search$replace$content); 
    I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out.

    Basically, I need help modiying the code above to remove all the crap but still leave certain tags.

    Thanks in advance,
    Martin

  2. #2
    Sidewalking anode's Avatar
    Join Date
    Mar 2001
    Location
    Philadelphia, US
    Posts
    2,205
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    TuitionFree a free library for the self-taught
    Anode Says... Blogging For Your Pleasure

  3. #3
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That code didn't work for some reason....it didn't output anything ocne I entered the file....

  4. #4
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    actually....I made a mistake...it is working now. I will try to tweak it so that it converts what is enetred in a textarea.

    The problem I outlined occurs in a WYSIWG HTML editor I am using to replace a normal textarea tag. The higher ups in work want to keep the formatting int the Word docs but add them to a new intranet web app. I need to remvoe the crap from the file. The problem is, when a file is copy and pasted from word into the WYSIWG editor it keeps all the MS Word HTML. I will just need to tweak that file so that it takes the content of the textarea as its source instead of the external file....

    I will post back if I have any problems

  5. #5
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi, I have a textarea on page1.php called text thus pasing its value to the next page as $text. The form for the textarea submits to clean.php which contains this code (slightly modified from the code at the above link)

    PHP Code:
    <?

                    
    // normalize white space
                
    $text eregi_replace("[[:space:]]+"" "$text);
                
    $text str_replace("> <",">\r\r<",$text);
                
    $text str_replace("<br>","<br>\r",$text);

                    
    // remove everything before <body>
                
    $text strstr($text,"<body");

                    
    // keep tags, strip attributes 
                
    $text ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text);
                
    $text eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text);
                
    $text str_replace("&nbsp;","",$text);

                    
    //clean up whatever is left inside <p> and <li>
                
    $text eregi_replace("<p [^>]*>","<p>",$text);
                
    $text eregi_replace("<li [^>]*>","<li>",$text);

                    
    // kill unwanted tags
                
    $text eregi_replace("</?span[^>]*>","",$text);
                
    $text eregi_replace("</?body[^>]*>","",$text);
                
    $text eregi_replace("</?div[^>]*>","",$text);
                
    $text eregi_replace("<\![^>]*>","",$text); 
                
    $text eregi_replace("</?[a-z]\:[^>]*>","",$text);

                    
    // kill style and on mouse* tags  
                
    $text eregi_replace("([ \f\r\t\n\'\"])style=[^>]+""\\1"$text); 
                
    $text eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+""\\1"$text); 

                    
    //remove empty paragraphs
                
    $text str_replace("<p></p>","",$text);
                
                    
    //remove closing </html>
                
    $text str_replace("</html>","",$text);

                    
    //clean up white space again
                
    $text eregi_replace("[[:space:]]+"" "$text);
                
    $text str_replace("> <",">\r\r<",$text);
                
    $text str_replace("<br>","<br>\r",$text);

        echo 
    $text."<p>";        print "html<br><textarea name=\"code\" rows=\"25\" cols=\"50\">$text</textarea>";
            
    ?>
    This isn't displaying anything for the variable $text, any ideas?

    Thanks,
    Martin

  6. #6
    SitePoint Guru marcel's Avatar
    Join Date
    Nov 2000
    Posts
    920
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    there is a class on phpclasses.org which takes care of this.

    I'm about to run out the door so I wasn't able to find the exact URL...

  7. #7
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I foudn that but I couldnt get it to work correctly.

    The code above is the closest to the solution I want...just need to overcome the small problem outlined above...

  8. #8
    ********* Wizard silver trophy Cam's Avatar
    Join Date
    Aug 2002
    Location
    Burpengary, Australia
    Posts
    4,495
    Mentioned
    0 Post(s)
    Tagged
    1 Thread(s)
    Wouldn't it pass it to the next page as $_POST['text'] unless you have register_globals ON?

  9. #9
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    register globals is set to on on the server this is for as its an intranet application.

    if I echo $test before all str_replace, eregi_replace etc. stuff it outputs the value of $test so it is being passed ok....

  10. #10
    ********* Wizard silver trophy Cam's Avatar
    Join Date
    Aug 2002
    Location
    Burpengary, Australia
    Posts
    4,495
    Mentioned
    0 Post(s)
    Tagged
    1 Thread(s)
    Don't mean to sound rude but your code says $text and your saying $test so make up your mind!!!!!

  11. #11
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    sorry....typo...I mean to type $text

  12. #12
    SitePoint Member mwint's Avatar
    Join Date
    Jun 2003
    Location
    Manchester
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If all you want to do is get rid of MS Word tags & you got a copy of dreamweaver 4, go to COMMANDS->Clean Up Word HTML
    Guaranteed to clean out most of the rubbish

  13. #13
    SitePoint Addict mcrumlish's Avatar
    Join Date
    Jan 2002
    Posts
    384
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I need to do it dynamically from code that is pasted into a WYSIWG text editing feature of a textarea which is then inserted into a database


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •