Scanned PDFs

Report · Jun 16, 2021

Hi,

I am trying to extract the text from a scanned PDF. But the output is not as expected. Can someone please help to figure out. Attached is the PDF.

Report · Jun 16, 2021

What does you get? What does you expect?

Report · Jun 16, 2021

Getting the attached ouput. No output is coming for page #6 to #23. We require clear output of all the pages in json format.

Report · Jun 16, 2021

There are no text on this pages, only images.

Report · Jun 16, 2021

There must be some work around to get the text out of this kind of pdfs. Would highly appreciate if you could suggest me how do i get the text.

Report · Jun 16, 2021

You can try OCR in Adobe Acrobat.

Report · Jun 16, 2021

https://aws.amazon.com/marketplace/pp/prodview-g2ikxe6zxsi64

Adobe PDF Services API is also working in the same way if i am not wrong. The json output which i had shared earlier with you was from Adobe PDF service only.

OCR in Adobe Acrobat will turned out to be a manual process, how do i integrate it with my python script. I am really sorry for bothering you but i really need a solution for this.

Report · Jun 16, 2021

You can perform OCR on the document:

https://opensource.adobe.com/pdftools-sdk-docs/release/latest/howtos.html#text-recognition-ocr

Report · Jun 16, 2021

Try splitting the PDF into one containing only scanned pages (eg. 1-5 of that document) and a second PDF that has the non-scanned pages (eg. 6-24).