How can i forensically identify if an OCR process has been applied to a PDF

Report · Aug 25, 2022

Hi Community,

I am trying to identify if a PDF file has undergone an OCR process.

Scenario:

The questioned PDF document is a certificate that is formed of image parts like the signature block, crest, and portions of a border. The text in the document is editable text but contains garbled words, similar to when an OCR process doesn't identify the characters properly. It seems to obvious for it to be fraud but in the cases I receive it is plausible.

Usually if the document is an image, the image undergoes an OCR process. This is easy to identify due the base document is an image. You can see this in "Content" tool, or select the image and download it etc.

Two questions i need to answer:

1. Is it possible a PDF document that is a scanned image that undergoes an OCR process segments the image into portions like signature block, crest .etc, recognises the text and discards most of the segmented images only leaving the signature block, crest and garbled text because it didn't read it correctly?

2. Is there a way of examing the internal structure or internal code to identify if an OCR process has occurred?

Report · Aug 25, 2022

Hello @Ben_FDE_2022,

There's couple of places you can look to see if a scanned document has been edited.

1. File > Properties > Description > Additional Metadata > Advanced > XMP Media Management Properties > xmpMM:History...

2. Edit > Preflight > Options > Browse Internal PDF Structure...

Hope this helps!

Regards,

Mike