SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Zealot
    Join Date
    Dec 2004
    Location
    Canada
    Posts
    162
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    stripping garbage from Word

    I am trying to strip all the tags and proprietary garbage from a word doc using PHP. I extract the doc's contents into a string, then pass that string to the function below. This gets rid of a lot of it, but unfortunately not all. Can anyone improve on this?

    function all_ascii( $stringIn ) {
    $final = '';
    $search = array(chr(145),chr(146),chr(147),chr(148),chr(150),chr(151));
    $replace = array("'","'",'"','"','-','-');

    $hold = str_replace($search[0],$replace[0],$stringIn);
    $hold = str_replace($search[1],$replace[1],$hold);
    $hold = str_replace($search[2],$replace[2],$hold);
    $hold = str_replace($search[3],$replace[3],$hold);
    $hold = str_replace($search[4],$replace[4],$hold);
    $hold = str_replace($search[5],$replace[5],$hold);

    $holdarr = str_split($hold);
    foreach ($holdarr as $val) {
    if (ord($val) < 128) $final .= $val;
    }
    return $final;
    }

  2. #2
    SitePoint Wizard triexa's Avatar
    Join Date
    Dec 2002
    Location
    Canada
    Posts
    2,476
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    How many other characters does it use?

    I recently ran into this problem but it looks like you have all the characters handled...?

    You might want to look in PEAR or phpclasses.org as I suspect there might at least be SOMETHING to work with word files...
    AskItOnline.com - Need answers? Ask it online.
    Create powerful online surveys with ease in minutes!
    Sign up for your FREE account today!
    Follow us on Twitter

  3. #3
    SitePoint Zealot
    Join Date
    Dec 2004
    Location
    Canada
    Posts
    162
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by triexa View Post
    How many other characters does it use?
    i wish I knew - I am still getting hundreds of question marks in a large Word doc.

  4. #4
    SitePoint Evangelist
    Join Date
    Jun 2006
    Location
    Wigan, Lancashire. UK
    Posts
    523
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by uprightdog View Post
    i wish I knew - I am still getting hundreds of question marks in a large Word doc.
    Make sure the page where you're displaying this is set as utf-8, or ensure you have a utf-8 BOM marker if you're writing it to a text file
    ---
    Development Projects:
    PHPExcel
    PHPPowerPoint

  5. #5
    Community Advisor silver trophy

    Join Date
    Nov 2006
    Location
    UK
    Posts
    2,559
    Mentioned
    40 Post(s)
    Tagged
    1 Thread(s)
    If you have root access on the server, install antiword it does a great job of converting word documents to text

  6. #6
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,871
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    The simplest way to convert what comes out of Word pretending to be HTML into real HTML is to do it as two steps.

    Step 1 remove all the tags from the page leaving plain text.

    Step 2. Add in the correct tags for what you require.

    It may be a bit difficult to automate the second step though.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •