SitePoint Sponsor

User Tag List

Results 1 to 15 of 15

Thread: PDF to TEXT???

  1. #1
    SitePoint Addict buildakicker's Avatar
    Join Date
    Jun 2005
    Location
    NorCal
    Posts
    378
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Smile PDF to TEXT???

    Hello all,

    Have any of you ever seen a script or way in php that will allow one to extract all text from the pdf file and display just a text file?

    Thanks!
    SKILEASES.COM - FREE rental listings!
    WILDFIREBLOG.COM - Wildland Fire microblog!

  2. #2
    SitePoint Zealot mcahill's Avatar
    Join Date
    May 2002
    Location
    Manchaug, MA, USA
    Posts
    180
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    I don't think so...

    Pdfs can be locked so you can't grab the text. That's why we use them for stuff we don't want customers changing, like contracts and invoices.

    That said, I notice that Google Mail allows you to open a pdf as html, and basically you get text...so the ability with unlocked pdfs must exist.
    mcahill
    Reel-Time.com - Saltwater Fly Fishing
    The Vario Blog
    VarioCreative.com 1 2 3 4 5 6 7

  3. #3
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Take your pick: pdftotext or pdftohtml

  4. #4
    SitePoint Addict buildakicker's Avatar
    Join Date
    Jun 2005
    Location
    NorCal
    Posts
    378
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    HA! I'll get back to you on how well they work.

    Thanks Kyhberfabrikken! (that's a hard one!)
    SKILEASES.COM - FREE rental listings!
    WILDFIREBLOG.COM - Wildland Fire microblog!

  5. #5
    SitePoint Addict buildakicker's Avatar
    Join Date
    Jun 2005
    Location
    NorCal
    Posts
    378
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Those were not really web based were they? I need one that does it when someone pushes a button called TEXT.

    Here is a function that sort of does it... really dirty output though... I'm trying to figure out what it all does then clean it up some if possible.

    Any suggestions are gladly accepted!
    <?php
    $test = pdf2string("file.pdf");
    echo "$test";

    # Returns a -1 if uncompression failed
    function pdf2string($sourcefile)
    {
    $fp = fopen($sourcefile, 'rb');
    $content = fread($fp, filesize($sourcefile));
    fclose($fp);

    # Locate all text hidden within the stream and endstream tags
    $searchstart = 'stream';
    $searchend = 'endstream';
    $pdfdocument = "";

    $pos = 0;
    $pos2 = 0;
    $startpos = 0;
    # Iterate through each stream block
    while( $pos !== false && $pos2 !== false )
    {
    # Grab beginning and end tag locations if they have not yet been parsed
    $pos = strpos($content, $searchstart, $startpos);
    $pos2 = strpos($content, $searchend, $startpos + 1);
    if( $pos !== false && $pos2 !== false )
    {
    # Extract compressed text from between stream tags and uncompress
    $textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
    $data = @gzuncompress($textsection);
    # Clean up text via a special function
    $data = ExtractText($data);
    # Increase our PDF pointer past the section we just read
    $startpos = $pos2 + strlen($searchend) - 1;
    if( $data === false ) { return -1; }
    $pdfdocument = $pdfdocument . $data;
    }
    }

    return $pdfdocument;
    }

    function ExtractText($postScriptData)
    {
    while( (($textStart = strpos($postScriptData, '(', $textStart)) && ($textEnd = strpos($postScriptData, ')', $textStart + 5)) && substr($postScriptData, $textEnd - 5) != '\\') )
    {
    $plainText .= substr($postScriptData, $textStart + 5, $textEnd - $textStart - 5);
    if( substr($postScriptData, $textEnd + 5, 5) == ']' ) // This adds quite some additional spaces between the words
    {
    $plainText .= ' ';
    }

    $textStart = $textStart < $textEnd ? $textEnd : $textStart + 5;
    }

    return stripslashes($plainText);
    }
    ?>
    SKILEASES.COM - FREE rental listings!
    WILDFIREBLOG.COM - Wildland Fire microblog!

  6. #6
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by buildakicker View Post
    Those were not really web based were they? I need one that does it when someone pushes a button called TEXT.
    No, they are binaries. You can call them from PHP, using exec. (You need to install them at the server first, if they aren't already)

  7. #7
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,094
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I found something very similar to what you have and does pretty good.
    PHP Code:
    function pdf2string($sourcefile) { 

        
    $fp fopen($sourcefile'rb'); 
        
    $content fread($fpfilesize($sourcefile)); 
        
    fclose($fp); 

        
    $searchstart 'stream'
        
    $searchend 'endstream'
        
    $pdfText ''
        
    $pos 0
        
    $pos2 0
        
    $startpos 0

        while (
    $pos !== false && $pos2 !== false) { 

            
    $pos strpos($content$searchstart$startpos); 
            
    $pos2 strpos($content$searchend$startpos 1); 

            if (
    $pos !== false && $pos2 !== false){ 

                if (
    $content[$pos] == 0x0d && $content[$pos 1] == 0x0a) { 
                    
    $pos += 2
                } else if (
    $content[$pos] == 0x0a) { 
                    
    $pos++; 
                } 

                if (
    $content[$pos2 2] == 0x0d && $content[$pos2 1] == 0x0a) { 
                    
    $pos2 -= 2
                } else if (
    $content[$pos2 1] == 0x0a) { 
                    
    $pos2--; 
                } 

                
    $textsection substr
                    
    $content
                    
    $pos strlen($searchstart) + 2
                    
    $pos2 $pos strlen($searchstart) - 
                
    ); 
                
    $data = @gzuncompress($textsection); 
                
    $pdfText .= pdfExtractText($data); 
                
    $startpos $pos2 strlen($searchend) - 1

            } 
        } 

        return 
    preg_replace('/(\s)+/'' '$pdfText); 



    function 
    pdfExtractText($psData){ 

        if (!
    is_string($psData)) { 
            return 
    ''
        } 

        
    $text ''

        
    // Handle brackets in the text stream that could be mistaken for 
        // the end of a text field. I'm sure you can do this as part of the 
        // regular expression, but my skills aren't good enough yet. 
        
    $psData str_replace('\)''##ENDBRACKET##'$psData); 
        
    $psData str_replace('\]''##ENDSBRACKET##'$psData); 

        
    preg_match_all
            
    '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si'
            
    $psData
            
    $matches 
        
    ); 
        for (
    $i 0$i sizeof($matches[0]); $i++) { 
            if (
    $matches[3][$i] != '') { 
                
    // Run another match over the contents. 
                
    preg_match_all('/\(([^)]*)\)/si'$matches[3][$i], $subMatches); 
                foreach (
    $subMatches[1] as $subMatch) { 
                    
    $text .= $subMatch
                } 
            } else if (
    $matches[4][$i] != '') { 
                
    $text .= ($matches[1][$i] == 'Tc' ' ' '') . $matches[4][$i]; 
            } 
        } 

        
    // Translate special characters and put back brackets. 
        
    $trans = array( 
            
    '...'                => '…'
            
    '\205'                => '…'
            
    '\221'                => chr(145), 
            
    '\222'                => chr(146), 
            
    '\223'                => chr(147), 
            
    '\224'                => chr(148), 
            
    '\226'                => '-'
            
    '\267'                => '•'
            
    '\('                => '('
            
    '\['                => '['
            
    '##ENDBRACKET##'    => ')'
            
    '##ENDSBRACKET##'    => ']'
            
    chr(133)            => '-'
            
    chr(141)            => chr(147), 
            
    chr(142)            => chr(148), 
            
    chr(143)            => chr(145), 
            
    chr(144)            => chr(146), 
        ); 
        
    $text strtr($text$trans); 

        return 
    $text


    $sourcefile 'February 2006.pdf';
    $get pdf2string($sourcefile);
    echo 
    $get

  8. #8
    SitePoint Enthusiast
    Join Date
    Aug 2007
    Posts
    47
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks @lorenw, this function is useful for me,too.

  9. #9
    SitePoint Addict buildakicker's Avatar
    Join Date
    Jun 2005
    Location
    NorCal
    Posts
    378
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Oh man, that one ones super great! Thanks lorenw! It cleans up the text real good.
    SKILEASES.COM - FREE rental listings!
    WILDFIREBLOG.COM - Wildland Fire microblog!

  10. #10
    SitePoint Addict buildakicker's Avatar
    Join Date
    Jun 2005
    Location
    NorCal
    Posts
    378
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I have a local install of XAMPP and this script works fine, however, on my hosted server, this script doesn't work? Any clues as to what PHP needs running in order for it to go? I cannot figure it out..

    Thanks!
    SKILEASES.COM - FREE rental listings!
    WILDFIREBLOG.COM - Wildland Fire microblog!

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You need zlib

  12. #12
    SitePoint Wizard lorenw's Avatar
    Join Date
    Feb 2005
    Location
    was rainy Oregon now sunny Florida
    Posts
    1,094
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Your welcome for the script and not sure where I dug it up, I needed it to make a short summary like google does, I searched for about a year, totally obsessed, now the client doesnt need it so it is sitting on my localhost, lol.

    maybe someday I can use it and glad it now has a home, hope you get it working.

    cheers

  13. #13
    SitePoint Guru ripcurlksm's Avatar
    Join Date
    Aug 2004
    Location
    San Clemente, CA
    Posts
    857
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    great script, thanks!

    here is a zip file of the script and a pdf

    http://www.sitepoint.com/forums/atta...0&d=1237880541
    Last edited by ripcurlksm; Mar 24, 2009 at 01:52.

  14. #14
    SitePoint Member
    Join Date
    Sep 2007
    Posts
    20
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Some pdf file can not be extracted with this script, any update?

  15. #15
    SitePoint Guru ripcurlksm's Avatar
    Join Date
    Aug 2004
    Location
    San Clemente, CA
    Posts
    857
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yea I had some issues as well with the script here and copy that I posted above. Some PDF's were not parsed properly.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •