Regular expression on MS Word content

Amentotaxus · September 2, 2013, 11:07am

Hi,

I try to extract some pieces of text from a string obtain with file_get_contents($ms_word_file) from a MS Word file. The big problem is the large number of html tags introduced by the Word program. If i try strip_tags i’ll lose the line breaks separation of text. If I don’t use it and try to get every line of content using explode(PHP_EOL, $file_content) I find that many original lines in ms word file are now each broken in multiple lines.
My goal is to get all original lines in an array of elements and after that applying strip_tags on every element of array. That will simplify the task of applying regular expressions.

I will appreciate any advice which help me solve my script problem