PDF Extract - Tables has poor OCR results

Forum|Forum|2 years ago
July 24, 2023
1 reply
1419 views

Hi,

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction - even when the documnet itself contains text and not scanned pages. These documents are German.

Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.

This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

This topic has been closed for replies.

J

Jonathan24251823baktAuthor

Participant

Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

Ciprofloxacin.pdf

structuredData.zip

J

Jonathan24251823baktAuthor

Participant

So contrary to original post I am guessing this is not an OCR issue.

This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

{'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}

The PDF appears to have the fonts embedded.

Font List
['/HPCHGJ+DigiHolsatia-Halbfett',
'/HPCHGK+DigiHolsatia-Mager',
'/HPCHHL+DigiHolsatia-Normal',
'/HPCHMK+AdvPi1',
'/HPCIFI+Symbol']

Unembedded Fonts
[]

Anyone any ideas??

Thanks!

J

Jonathan24251823baktAuthor

Participant

Does anyone have any ind´sights? Would be much appreciated as I am stuck.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded