After a lot of fussing around, I finally got PDFBox working in .NET. My adventure is documented here.
My goal was to extract text from the OCR Layer in several PDFs. I got it to work, but for some reason there are a huge number of special characters, rather than “normal” characters.
Here’s an example of the type of text I got:
( ) Minor Change No. M~----
~ 1"0 egol/t.l>~ PE:D!G1Tt()~ ~~ DIAf’rt()f\6MS foR… I IT £’11/GINZEI…Z./)
Dlt\Pit(lMM Vt\LV1:::S, D~I&IAIALL’t SvPPZ…<~C> BY G R!NN£LL
o rz: we:.> Tl N&tto u<7e.
Could this be an encoding issue? Is there a PDFBox property that I’m forgetting? Maybe this is typical output from an OCR doc…