Hello all,
Have any of you ever seen a script or way in php that will allow one to extract all text from the pdf file and display just a text file?
Thanks!
| SitePoint Sponsor |


Hello all,
Have any of you ever seen a script or way in php that will allow one to extract all text from the pdf file and display just a text file?
Thanks!

Pdfs can be locked so you can't grab the text. That's why we use them for stuff we don't want customers changing, like contracts and invoices.
That said, I notice that Google Mail allows you to open a pdf as html, and basically you get text...so the ability with unlocked pdfs must exist.


HA! I'll get back to you on how well they work.
Thanks Kyhberfabrikken! (that's a hard one!)


Those were not really web based were they? I need one that does it when someone pushes a button called TEXT.
Here is a function that sort of does it... really dirty output though... I'm trying to figure out what it all does then clean it up some if possible.
Any suggestions are gladly accepted!
<?php
$test = pdf2string("file.pdf");
echo "$test";
# Returns a -1 if uncompression failed
function pdf2string($sourcefile)
{
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
# Locate all text hidden within the stream and endstream tags
$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = "";
$pos = 0;
$pos2 = 0;
$startpos = 0;
# Iterate through each stream block
while( $pos !== false && $pos2 !== false )
{
# Grab beginning and end tag locations if they have not yet been parsed
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if( $pos !== false && $pos2 !== false )
{
# Extract compressed text from between stream tags and uncompress
$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
# Clean up text via a special function
$data = ExtractText($data);
# Increase our PDF pointer past the section we just read
$startpos = $pos2 + strlen($searchend) - 1;
if( $data === false ) { return -1; }
$pdfdocument = $pdfdocument . $data;
}
}
return $pdfdocument;
}
function ExtractText($postScriptData)
{
while( (($textStart = strpos($postScriptData, '(', $textStart)) && ($textEnd = strpos($postScriptData, ')', $textStart + 5)) && substr($postScriptData, $textEnd - 5) != '\\') )
{
$plainText .= substr($postScriptData, $textStart + 5, $textEnd - $textStart - 5);
if( substr($postScriptData, $textEnd + 5, 5) == ']' ) // This adds quite some additional spaces between the words
{
$plainText .= ' ';
}
$textStart = $textStart < $textEnd ? $textEnd : $textStart + 5;
}
return stripslashes($plainText);
}
?>
No, they are binaries. You can call them from PHP, using exec. (You need to install them at the server first, if they aren't already)





I found something very similar to what you have and does pretty good.
PHP Code:function pdf2string($sourcefile) {
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}
if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}
$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
}
}
return preg_replace('/(\s)+/', ' ', $pdfText);
}
function pdfExtractText($psData){
if (!is_string($psData)) {
return '';
}
$text = '';
// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}
// Translate special characters and put back brackets.
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);
return $text;
}
$sourcefile = 'February 2006.pdf';
$get = pdf2string($sourcefile);
echo $get;
Thanks @lorenw, this function is useful for me,too.


Oh man, that one ones super great! Thanks lorenw! It cleans up the text real good.


I have a local install of XAMPP and this script works fine, however, on my hosted server, this script doesn't work? Any clues as to what PHP needs running in order for it to go? I cannot figure it out..
Thanks!
You need zlib





Your welcome for the script and not sure where I dug it up, I needed it to make a short summary like google does, I searched for about a year, totally obsessed, now the client doesnt need it so it is sitting on my localhost, lol.
maybe someday I can use it and glad it now has a home, hope you get it working.
cheers




great script, thanks!
here is a zip file of the script and a pdf
http://www.sitepoint.com/forums/atta...0&d=1237880541
Last edited by ripcurlksm; Mar 24, 2009 at 01:52.
Some pdf file can not be extracted with this script, any update?




Yea I had some issues as well with the script here and copy that I posted above. Some PDF's were not parsed properly.
Bookmarks