Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Scanned PDFs

New Here ,
Jun 16, 2021 Jun 16, 2021

Hi,

 

I am trying to extract the text from a scanned PDF. But the output is not as expected. Can someone please help to figure out. Attached is the PDF.

1.1K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 16, 2021 Jun 16, 2021

What does you get? What does you expect?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 16, 2021 Jun 16, 2021

Getting the attached ouput. No output is coming for page #6 to #23. We require clear output of all the pages in json format.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 16, 2021 Jun 16, 2021

There are no text on this pages, only images.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 16, 2021 Jun 16, 2021

There must be some work around to get the text out of this kind of pdfs. Would highly appreciate if you could suggest me how do i get the text.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 16, 2021 Jun 16, 2021

You can try OCR in Adobe Acrobat.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 16, 2021 Jun 16, 2021

https://aws.amazon.com/marketplace/pp/prodview-g2ikxe6zxsi64

 

Adobe PDF Services API is also working in the same way if i am not wrong. The json output which i had shared earlier with you was from Adobe PDF service only.

 

OCR in Adobe Acrobat will turned out to be a manual process, how do i integrate it with my python script. I am really sorry for bothering you but i really need a solution for this.

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 16, 2021 Jun 16, 2021
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jun 16, 2021 Jun 16, 2021
LATEST

Try splitting the PDF into one containing only scanned pages (eg. 1-5 of that document) and a second PDF that has the non-scanned pages (eg. 6-24).

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources