SitePoint Sponsor

User Tag List

Results 1 to 3 of 3

Thread: PDF -> XML

  1. #1
    SitePoint Wizard
    Join Date
    Jul 2003
    Location
    Corner seat
    Posts
    1,069
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    PDF -> XML

    Hi,

    I want to extract PDF file and get a set of data. I heard that there is a way to extract PDF into xml and parse it. If I understand correctly, Oracle provides a tool to do that. Now isn't there any other way to extract PDF and get xml? What I want to do is something like the following; let's say that restaurants.PDF has a list of restaurants. Address, phone, email, fax, menu, etc. I want to extract the file and get the list of restaurants with address, phone, email, fax, menu, etc. in such a format that I can dump straight into a SQL database.

  2. #2
    Yugo full of anvils bronze trophy hillsy's Avatar
    Join Date
    May 2001
    Location
    :noitacoL
    Posts
    1,859
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hmmmm. I think you'd be very, very lucky to find a tool that will let you do that. XML data is inherently structured and PDF data is presentation-based (and notoriously unstructured).

    I won't go so far as to say there's no such tool, just in case there's one out there I haven't heard of. But if there is such a tool, I'd love to know about it
    that's me!
    Now A Pom. And a Plone Nut
    Broccoli Martinez Airpark

  3. #3
    SitePoint Wizard
    Join Date
    Jan 2001
    Location
    Milton Keynes, UK
    Posts
    1,011
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I would have said the same as hillsy, but what d'ya know...

    http://www.google.com/search?hl=en&i...2pdf+to+xml%22

    You'd probably still have some work to do to get the individual attributes though. If the pdf's are in a consistent format you could probably just write a xsl file to transform the pdf->xml into something a more useful xml format.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •