• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

PDF Extract - Tables has poor OCR results

Community Beginner ,
Jul 24, 2023 Jul 24, 2023

Copy link to clipboard

Copied

Hi,

 

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction  - even when the documnet itself contains text and not scanned pages. These documents are German.


Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.


This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

Views

602

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 24, 2023 Jul 24, 2023

Copy link to clipboard

Copied

Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 24, 2023 Jul 24, 2023

Copy link to clipboard

Copied

So contrary to original post I am guessing this is not an OCR issue.

This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

{'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}


The PDF appears to have the fonts embedded.

Font List
['/HPCHGJ+DigiHolsatia-Halbfett',
'/HPCHGK+DigiHolsatia-Mager',
'/HPCHHL+DigiHolsatia-Normal',
'/HPCHMK+AdvPi1',
'/HPCIFI+Symbol']

Unembedded Fonts
[]

 

Anyone any ideas??

Thanks!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 27, 2023 Jul 27, 2023

Copy link to clipboard

Copied

Does anyone have any ind´sights? Would be much appreciated as I am stuck.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 22, 2024 Jan 22, 2024

Copy link to clipboard

Copied

LATEST

hi @Jonathan24251823bakt 

 

as far as I have read, OCR was tuned on English.  So there is a chance, german doesn`t work that well

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources