Text Recognition on PDF is Inaccurate

Hi all. I noticed recently that the text recognition on my resume, saved in PDF format, is inaccurate. I designed the resume in Apple Pages and exported to PDF. However—and possibly because of my font—the text recognition has replaced my t’s with P’s. For example, if you copy-paste some text from my resume, it will read:

“InvesPgate financial mismanagement and regulatory violaPons in the aviaPon, defense, environmental, and healthcare industries.”

Why might this be happening?

When a PDF is made up from rasterised images, rather than actual text, it will use OCR to convert the image to text. OCR can be prone to errors, depending on image quality, fonts or just how good the OCR software is.
But these types of PDF usually come from scanned documents. I would not expect an export from Pages to produce bitmap images, though I have not used it.

I agree, which is the confusion. You can see for yourself here by clicking here

I don’t see any t’s appearing as p’s…

Did you copy the text and paste into a text editor?
I see the problem, very strange. The text does appear to be actual text.

• InvesPgate financial mismanagement and regulatory violaPons in the aviaPon, defense,
environmental, and healthcare industries. Report to Congress and the President.

No, I missed that. :der: I believe that PDFs can hold both a scanned version and a text/OCR version of the same document. My guess would be that that is where the problem lies, although I don’t know why Quartz would need to create a scanned version I don’t know.

The "t"s are not getting changed to "P"s

The "ti"s are.

Good catch, thanks.
So, now what? :slight_smile:

To be honest, I don’t know. It might be worth trying to “train” the app to do better. Else I guess you could retype the résumé manually.

I’m not sure what you mean on either count. Could you explain?

Sorry, I saw the OCR and missed the Apple Pages. So l thought you had scanned a print copy. lf that had been the case, then it might have been possible to improve the recognition or to type the text into a word processing program. Which of course is what Apple Pages is.

I just now looked at my Apple Pages to check the export to PDF but didn’t find it.

Might it be you crafted the résumé using a non-standard keyboard setting that didn’t get “translated” to standard?

Though it seems there would be “funny looking characters” if that was the case.

Try again using a different app?
Writing a good résumé is tough enough let alone making sure it doesn’t have any words with “ti” in them.

1 Like

If I import the PDF into Illustrator (as well as complaining about missing fonts) the text comes out as gobbledegook, as if it’s maybe a different character set or something.
Maybe that would explain why it is OCRing actual text. That’s the bit that did not make any sense to me, why would it OCR text?

That’s my position.

As @Mittineague suggests, perhaps try another tool rather than Quartz?

There’s nothing unusual about the keyboard. The font is Calibri Light. To make the PDF, I just exported as PDF.

What’s Quartz, and what do you mean by use a tool other than Quartz?

That’s what has been used to create the PDF - Mac OS X 10.12.3 Quartz PDFContext to give it it’s full name.

As well as Calibri (normal, bold, light and light-italic) you are also using FontAwesome and LucidaGrande (normal and bold) FWIW.

Ah, I have a strong feeling that is what’s causing issues.
Maybe a kind of “see any image-text character, treat all text as an image” thing.

This is what that looks like, in case it gives any clue to anyone:-

LucidaGrande must be hiding somewhere, because none of my text is actually LucidaGrande. Any idea how to delete that font if I can’t locate it in the text?

Should Illustrator be able to import this?

As a test, I highlighted all text and turned it all into Calibri. Then I copy-pasted from PDF again. This is what happened:
“InvesXgate financial mismanagement and regulatory violaXons in the aviaXon, defense,
environmental, and healthcare industries.”