Using PDFBox to Extract OCR Text from PDFs in .NET

SharpBarb · December 14, 2011, 5:28am

After a lot of fussing around, I finally got PDFBox working in .NET. My adventure is documented here.

My goal was to extract text from the OCR Layer in several PDFs. I got it to work, but for some reason there are a huge number of special characters, rather than “normal” characters.
Here’s an example of the type of text I got:

( ) Minor Change No. M~----
~ 1"0 egol/t.l>~ PE:D!G1Tt()~ ~~ DIAf’rt()f\6MS foR… I IT £’11/GINZEI…Z./)
Dlt\Pit(lMM Vt\LV1:::S, D~I&IAIALL’t SvPPZ…<~C> BY G R!NN£LL
o rz: we:.> Tl N&tto u<7e.

Could this be an encoding issue? Is there a PDFBox property that I’m forgetting? Maybe this is typical output from an OCR doc…

wwb_99 · December 14, 2011, 1:26pm

About par for the OCR course . . .