SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Member
    Join Date
    Sep 2011
    Posts
    9
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Using PDFBox to Extract OCR Text from PDFs in .NET

    After a lot of fussing around, I finally got PDFBox working in .NET. My adventure is documented here.

    My goal was to extract text from the OCR Layer in several PDFs. I got it to work, but for some reason there are a huge number of special characters, rather than "normal" characters.
    Here's an example of the type of text I got:

    ( ) Minor Change No. M~----
    ~ 1"0 egol/t.l>~ PE!G1Tt()~ ~~ DIAf'rt()f\6MS foR.. I IT ú'11/GINZEI..Z./)
    Dlt\Pit(lMM Vt\LV1:::S, D~I&IAIALL't SvPPZ..<~C> BY G R!NNúLL
    o rz: we:.> Tl N&tto u<7e.

    Could this be an encoding issue? Is there a PDFBox property that I'm forgetting? Maybe this is typical output from an OCR doc...

  2. #2
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,623
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    About par for the OCR course . . .


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •