SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Member
    Join Date
    Feb 2005
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Extract text from PDF

    Hi there,

    I am currently working on an application for the Semantic Web and I am doing all prototyping in PHP. Currently I try to index PDF documents but the only (free) way to get the text out of a PDF I have found is pdftohtml which is a command line application. This is rather awkard and not very efficient and I would really appriciate if somebody knew if there is a PHP class/package that could to the job. I don't need to get the layout or images, just the plain text.

    Thanks a lot,
    B.

  2. #2
    SitePoint Enthusiast lucius910's Avatar
    Join Date
    Jul 2004
    Location
    Providence, RI
    Posts
    34
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Try http://us4.php.net/pdf. In the comments at the bottom, the functions posted 04-Feb-2005 02:44 works pretty well. Extracts the plain text rather nicely. Can be a bit buggy though if you re trying to grab a PDF that uses a lot of images.

    Lucas


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •