Reading Word Document in PHP

Hi All

I am reading an word document by using the following code

<?php
$filename = “C:/wamp/www/OpenID.doc”;
$word = new COM(“word.application”) or die(“Unable to instantiate Word”);
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . “.txt”;
// the ‘2’ parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
//$word->Release();
$word = NULL;
unset($word);

$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);
echo("&lt;pre&gt;$contents&lt;/pre&gt;");

?>

But then it prints the $contents on the browser then the word formatting are missing , can anybody suggest me how to maintain the formatting of the document. And it should be displayed on browser as it is in the word document.

I don’t know this “word.application”, but when you save as a text file, you loose all formatting. Isn’t there a $word->Documents[1]->Display or something like that, that echoes the formatted document content?

I don’t think there is any way of converting from a Word document to equivalently formatted HTML without manually coding all the HTML tags yourself. If there were then Microsoft would have incorporated that into Word itself instead of the “Word to garbage that looks a bit like HTML” filter that it currently uses.

// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);

to

// the '10' parameter specifies saving in filtered HTML format
$word->Documents[1]->SaveAs($new_filename, 10);

Gmail does a fairly decent job of this, although it’s not perfect.

You mean GMail does a reasonable job of understanding the Garbage that Microsoft programs create instead of proper HTML. I don’t think GMail has the ability to read Word Documents directly.

Aye. Thanks for the correction. :wink:

The closest to HTML hat you can get from a Word Document is to open the Word Document in Open Office and use the HTML save option there which will at least produce valid HTML even though it will discard some of the formatting to do so. What formatting it discards has no proper HTML equivalent,

Openoffice does a fair job of reading word documents, and it can write the output as html. You can get openoffice in a “headless” version - eg. a commandline utility. With that, you can convert word documents to html. I believe Google is using that, in some variant.