SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Member
    Join Date
    Oct 2006
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Converting PDF content into Data for import - XML/PHP/MYSQL?

    Hello peeps

    I wondered if anyone knew of a method to import text and image content into a MySQL database? Or to convert a set of PDFs into word files?

    I did search for some tools to do the conversion but results are not brilliant.

    As you can imagine, some PDFs are layed out in different columns, causing the conversion to show text in the wrong place!

    I had heard though that it may be possible to add XML tags into the PDF in the appropriate places and running a process will import the correct fields?

    Something like:

    <title>[I]Article Title Appears Here[I]</title>
    <intro>[I]Article Intro Here[I]</intro>
    <photo>[I]Main photo[I]</photo>
    <author> and so on....

    Any help will be appreciated

  2. #2
    SitePoint Member
    Join Date
    Nov 2006
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am looking for that too, and this is what I have found so far. I am trying to make it work.

    <?php

    $text = pdf2string("file.pdf");
    echo $text;

    function pdf2string($sourcefile){
    $fp = fopen($sourcefile, 'rb');
    $content = fread($fp, filesize($sourcefile));
    fclose($fp);

    $searchstart = 'stream';
    $searchend = 'endstream';
    $pdfdocument = '';
    $pos = 0;
    $pos2 = 0;
    $startpos = 0;

    while( $pos !== false && $pos2 !== false ){
    $pos = strpos($content, $searchstart, $startpos);
    $pos2 = strpos($content, $searchend, $startpos + 1);

    if ($pos !== false && $pos2 !== false){
    if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2;
    else if ($content[$pos]==0x0a) $pos++;

    if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2;
    else if ($content[$pos2-1]==0x0a) $pos2--;

    $textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
    $data = @gzuncompress($textsection);
    $data = ExtractText2($data);
    $startpos = $pos2 + strlen($searchend) - 1;

    if ($data === false){
    return -1;}

    $pdfdocument .= $data;}}
    return $pdfdocument;}

    function ExtractText2($postScriptData){
    $sw = true;
    $textStart = 0;
    $len = strlen($postScriptData);

    while ($sw){
    $ini = strpos($postScriptData, '(', $textStart);
    $end = strpos($postScriptData, ')', $textStart+1);
    if (($ini>0) && ($end>$ini)){
    $valtext = strpos($postScriptData,'Tj',$end+1);
    if ($valtext == $end + 2)
    $text .= substr($postScriptData,$ini+1,$end - $ini - 1);}

    $textStart = $end + 1;
    if ($len<=$textStart) $sw=false;

    if (($ini == 0) && ($end == 0)) $sw=false;}

    $trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => "");
    $text = strtr($text, $trans);
    return $text;
    }
    ?>

  3. #3
    SitePoint Member
    Join Date
    Oct 2006
    Posts
    3
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    looks interesting - where did you research this? I'd be keen to catch up with the progress you made!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •