SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Enthusiast
    Join Date
    Jan 2002
    Location
    Israel
    Posts
    57
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    extracting text from pdf

    i've been tring to solve this for quit some time and searched the web and read RFC's (1950,1951) and adobe documentation, but still... can't figure it out...
    any help will be greatly apprecheated.

    well, i have a PDF file.
    in the file itself among all other objects u have the content object.
    The Content object and infact many other objects in a pdf document are compressed using the "FlateDecode" encoding which is as i understand it somekind of opensource and somewhat better compression than LZW.
    now, i have done some reading and i know that the compressed data contains a tree of lengths where the most common char is the one with the shortest length from the tree root...and so on.
    what i dont understand is how form a specific pack of compressed data that looks like lots of garbage, can u extract the tree and the characters...

    this is a copy paste section.
    length 71 bytes.

    X-1
    @ ~_/mr A{!o,L3BEgp̤>SރcJ `ݘK]
    ^


    how from this can i by using FlateDecode extract the tree, the chars and the actuall data that is encoded there?

    now for the problems, i cant install anything that doesnt comes with the basic php pack... that is no zlib and no any other compression lib....
    any1?

    thnx in advanced
    Shanor.
    You can't see your self in the mirror with your eyes closed!

  2. #2
    SitePoint Member
    Join Date
    Jan 2008
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Did you ever get an answer on this post?

  3. #3
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Whoa. Old thread.

    Quote Originally Posted by tuppas2 View Post
    Did you ever get an answer on this post?
    Are you wanting to extract text from PDF files?

    I've been trying to find a solution for a while now. Nothing concrete. There's the function below which works with PDF's with the version 1.4, but nothing else.

    Code PHP:
    function pdf2string($sourcefile) {
    	/*
        $fp = fopen($sourcefile, 'rb'); 
        $content = fread($fp, filesize($sourcefile)); 
        fclose($fp); 
    	*/
    	$content = file_get_contents($sourcefile);
        $searchstart = 'stream'; 
        $searchend = 'endstream'; 
        $pdfText = ''; 
        $pos = 0; 
        $pos2 = 0; 
        $startpos = 0; 
        while ($pos !== false && $pos2 !== false) { 
            $pos = strpos($content, $searchstart, $startpos); 
            $pos2 = strpos($content, $searchend, $startpos + 1); 
            if ($pos !== false && $pos2 !== false){ 
                if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) { 
                    $pos += 2; 
                } else if ($content[$pos] == 0x0a) { 
                    $pos++; 
                } 
                if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) { 
                    $pos2 -= 2; 
                } else if ($content[$pos2 - 1] == 0x0a) { 
     
                    $pos2--; 
                } 
                $textsection = substr( 
                    $content, 
                    $pos + strlen($searchstart) + 2, 
                    $pos2 - $pos - strlen($searchstart) - 1 
                ); 
                $data = @gzuncompress($textsection); 
                $pdfText .= pdfExtractText($data); 
                $startpos = $pos2 + strlen($searchend) - 1; 
            } 
        } 
        return preg_replace('/(\s)+/', ' ', $pdfText); 
    }

    There's quite a few more functions here, but most of them don't work.

    If you happen to come across a solution, please let me know.

    Edit: Have you seen this thread?

  4. #4
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,094
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I use this and it does the job.
    PHP Code:
    function pdf2string($sourcefile) { 

        
    $fp fopen($sourcefile'rb'); 
        
    $content fread($fpfilesize($sourcefile)); 
        
    fclose($fp); 

        
    $searchstart 'stream'
        
    $searchend 'endstream'
        
    $pdfText ''
        
    $pos 0
        
    $pos2 0
        
    $startpos 0

        while (
    $pos !== false && $pos2 !== false) { 

            
    $pos strpos($content$searchstart$startpos); 
            
    $pos2 strpos($content$searchend$startpos 1); 

            if (
    $pos !== false && $pos2 !== false){ 

                if (
    $content[$pos] == 0x0d && $content[$pos 1] == 0x0a) { 
                    
    $pos += 2
                } else if (
    $content[$pos] == 0x0a) { 
                    
    $pos++; 
                } 

                if (
    $content[$pos2 2] == 0x0d && $content[$pos2 1] == 0x0a) { 
                    
    $pos2 -= 2
                } else if (
    $content[$pos2 1] == 0x0a) { 
                    
    $pos2--; 
                } 

                
    $textsection substr
                    
    $content
                    
    $pos strlen($searchstart) + 2
                    
    $pos2 $pos strlen($searchstart) - 
                
    ); 
                
    $data = @gzuncompress($textsection); 
                
    $pdfText .= pdfExtractText($data); 
                
    $startpos $pos2 strlen($searchend) - 1

            } 
        } 

        return 
    preg_replace('/(\s)+/'' '$pdfText); 



    function 
    pdfExtractText($psData){ 

        if (!
    is_string($psData)) { 
            return 
    ''
        } 

        
    $text ''

        
    // Handle brackets in the text stream that could be mistaken for 
        // the end of a text field. I'm sure you can do this as part of the 
        // regular expression, but my skills aren't good enough yet. 
        
    $psData str_replace('\)''##ENDBRACKET##'$psData); 
        
    $psData str_replace('\]''##ENDSBRACKET##'$psData); 

        
    preg_match_all
            
    '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si'
            
    $psData
            
    $matches 
        
    ); 
        for (
    $i 0$i sizeof($matches[0]); $i++) { 
            if (
    $matches[3][$i] != '') { 
                
    // Run another match over the contents. 
                
    preg_match_all('/\(([^)]*)\)/si'$matches[3][$i], $subMatches); 
                foreach (
    $subMatches[1] as $subMatch) { 
                    
    $text .= $subMatch
                } 
            } else if (
    $matches[4][$i] != '') { 
                
    $text .= ($matches[1][$i] == 'Tc' ' ' '') . $matches[4][$i]; 
            } 
        } 

        
    // Translate special characters and put back brackets. 
        
    $trans = array( 
            
    '...'                => '…'
            
    '\205'                => '…'
            
    '\221'                => chr(145), 
            
    '\222'                => chr(146), 
            
    '\223'                => chr(147), 
            
    '\224'                => chr(148), 
            
    '\226'                => '-'
            
    '\267'                => '•'
            
    '\('                => '('
            
    '\['                => '['
            
    '##ENDBRACKET##'    => ')'
            
    '##ENDSBRACKET##'    => ']'
            
    chr(133)            => '-'
            
    chr(141)            => chr(147), 
            
    chr(142)            => chr(148), 
            
    chr(143)            => chr(145), 
            
    chr(144)            => chr(146), 
        ); 
        
    $text strtr($text$trans); 

        return 
    $text


    $sourcefile 'February.pdf';
    $get pdf2string($sourcefile);
    echo 
    $get
    What I lack in acuracy I make up for in misteaks

  5. #5
    SitePoint Addict silentcollision's Avatar
    Join Date
    Jun 2006
    Location
    New Zealand
    Posts
    388
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lorenw View Post
    I use this and it does the job.
    PHP Code:
    function pdf2string($sourcefile) { 

        
    $fp fopen($sourcefile'rb'); 
        
    $content fread($fpfilesize($sourcefile)); 
        
    fclose($fp); 

        
    $searchstart 'stream'
        
    $searchend 'endstream'
        
    $pdfText ''
        
    $pos 0
        
    $pos2 0
        
    $startpos 0

        while (
    $pos !== false && $pos2 !== false) { 

            
    $pos strpos($content$searchstart$startpos); 
            
    $pos2 strpos($content$searchend$startpos 1); 

            if (
    $pos !== false && $pos2 !== false){ 

                if (
    $content[$pos] == 0x0d && $content[$pos 1] == 0x0a) { 
                    
    $pos += 2
                } else if (
    $content[$pos] == 0x0a) { 
                    
    $pos++; 
                } 

                if (
    $content[$pos2 2] == 0x0d && $content[$pos2 1] == 0x0a) { 
                    
    $pos2 -= 2
                } else if (
    $content[$pos2 1] == 0x0a) { 
                    
    $pos2--; 
                } 

                
    $textsection substr
                    
    $content
                    
    $pos strlen($searchstart) + 2
                    
    $pos2 $pos strlen($searchstart) - 
                
    ); 
                
    $data = @gzuncompress($textsection); 
                
    $pdfText .= pdfExtractText($data); 
                
    $startpos $pos2 strlen($searchend) - 1

            } 
        } 

        return 
    preg_replace('/(\s)+/'' '$pdfText); 



    function 
    pdfExtractText($psData){ 

        if (!
    is_string($psData)) { 
            return 
    ''
        } 

        
    $text ''

        
    // Handle brackets in the text stream that could be mistaken for 
        // the end of a text field. I'm sure you can do this as part of the 
        // regular expression, but my skills aren't good enough yet. 
        
    $psData str_replace('\)''##ENDBRACKET##'$psData); 
        
    $psData str_replace('\]''##ENDSBRACKET##'$psData); 

        
    preg_match_all
            
    '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si'
            
    $psData
            
    $matches 
        
    ); 
        for (
    $i 0$i sizeof($matches[0]); $i++) { 
            if (
    $matches[3][$i] != '') { 
                
    // Run another match over the contents. 
                
    preg_match_all('/\(([^)]*)\)/si'$matches[3][$i], $subMatches); 
                foreach (
    $subMatches[1] as $subMatch) { 
                    
    $text .= $subMatch
                } 
            } else if (
    $matches[4][$i] != '') { 
                
    $text .= ($matches[1][$i] == 'Tc' ' ' '') . $matches[4][$i]; 
            } 
        } 

        
    // Translate special characters and put back brackets. 
        
    $trans = array( 
            
    '...'                => ''
            
    '\205'                => ''
            
    '\221'                => chr(145), 
            
    '\222'                => chr(146), 
            
    '\223'                => chr(147), 
            
    '\224'                => chr(148), 
            
    '\226'                => '-'
            
    '\267'                => ''
            
    '\('                => '('
            
    '\['                => '['
            
    '##ENDBRACKET##'    => ')'
            
    '##ENDSBRACKET##'    => ']'
            
    chr(133)            => '-'
            
    chr(141)            => chr(147), 
            
    chr(142)            => chr(148), 
            
    chr(143)            => chr(145), 
            
    chr(144)            => chr(146), 
        ); 
        
    $text strtr($text$trans); 

        return 
    $text


    $sourcefile 'February.pdf';
    $get pdf2string($sourcefile);
    echo 
    $get
    I can only get that working for PDFs using version 1.4. No other version will work.

  6. #6
    SitePoint Member
    Join Date
    Nov 2007
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am using the function posted by 'lorenw', and need to be able to generate the plain-text string with word breaks (so as to make it possible to search the string for the occurrence of phrases). Any help would be greatly appreciated!!

  7. #7
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,094
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I posted that and have never actually used it in production.

    Doesn't echo $get; give you word breaks? I have an accompanying function that echo's out the first forty words and relied on word breaks (spaces).

    That script is probably 2 years old by now and worked last time I checked.

    anyway $get should give you a text string.
    What I lack in acuracy I make up for in misteaks

  8. #8
    SitePoint Member
    Join Date
    Nov 2007
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the quick reply! $get gives me a string with all the characters in the pdf file without any spaces (word breaks) - I've tried it on several different files, and all yield similar results. Could you post that second function that you mentioned??

    Thanks for the help.

  9. #9
    SitePoint Wizard frank1's Avatar
    Join Date
    Oct 2005
    Posts
    1,392
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well this was muh helpful

    will have a look into it..

    any way
    The best utilization of these kind of thing i have seen i here..
    http://www.olivesoftware.com/demos/d...nver_post.html

    any way i am not being able to convert those pdf(big) to text and define..those sizes...
    (others are easy...permission mgnt,seo things,rewrite and all)
    just those part...
    i have general idea...if any experts are ready to work on that part commercially pm me...

    well i feel it is not against tos of sitepoint to say so,actually i want some expert to assit me or do hard part..but i dont feel people wont do it if i ask it for free...

    thanks

  10. #10
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,094
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I just tried reading a number of pdf's, some could not be read however all of the pdf's that could be read did have spaces between the words.

    Just did a G for
    "php" "pdf to text"

    You may find a quick answer there. It seems to be a poular topic.
    What I lack in acuracy I make up for in misteaks


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •