PDF Extract - Tables has poor OCR results

Forum|Forum|2 years ago
July 24, 2023
1 commentaire
1413 vue

Hi,

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction - even when the documnet itself contains text and not scanned pages. These documents are German.

Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.

This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

Ce sujet a été fermé aux réponses.

J

Jonathan24251823baktAuteur

Participant

Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

Ciprofloxacin.pdf

structuredData.zip

J

Jonathan24251823baktAuteur

Participant

So contrary to original post I am guessing this is not an OCR issue.

This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

{'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}

The PDF appears to have the fonts embedded.

Font List
['/HPCHGJ+DigiHolsatia-Halbfett',
'/HPCHGK+DigiHolsatia-Mager',
'/HPCHHL+DigiHolsatia-Normal',
'/HPCHMK+AdvPi1',
'/HPCIFI+Symbol']

Unembedded Fonts
[]

Anyone any ideas??

Thanks!

J

Jonathan24251823baktAuteur

Participant

Does anyone have any ind´sights? Would be much appreciated as I am stuck.

Inscrivez-vous

Login social

Bienvenue

Login social

Analyse virus du fichier

Ce fichier ne peut pas être téléchargé