Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

PDF Extract - Tables has poor OCR results

Community Beginner ,
Jul 24, 2023 Jul 24, 2023

Hi,

 

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction  - even when the documnet itself contains text and not scanned pages. These documents are German.


Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.


This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

1.0K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 24, 2023 Jul 24, 2023

Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 24, 2023 Jul 24, 2023

So contrary to original post I am guessing this is not an OCR issue.

This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

{'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}


The PDF appears to have the fonts embedded.

Font List
['/HPCHGJ+DigiHolsatia-Halbfett',
'/HPCHGK+DigiHolsatia-Mager',
'/HPCHHL+DigiHolsatia-Normal',
'/HPCHMK+AdvPi1',
'/HPCIFI+Symbol']

Unembedded Fonts
[]

 

Anyone any ideas??

Thanks!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 27, 2023 Jul 27, 2023

Does anyone have any ind´sights? Would be much appreciated as I am stuck.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 22, 2024 Jan 22, 2024
LATEST

hi @Jonathan24251823bakt 

 

as far as I have read, OCR was tuned on English.  So there is a chance, german doesn`t work that well

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources