Skip to main content
Participant
July 24, 2023
Question

PDF Extract - Tables has poor OCR results

  • July 24, 2023
  • 1 reply
  • 1403 views

Hi,

 

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction  - even when the documnet itself contains text and not scanned pages. These documents are German.


Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.


This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

    This topic has been closed for replies.

    1 reply

    Participant
    July 24, 2023

    Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

    Participant
    July 24, 2023

    So contrary to original post I am guessing this is not an OCR issue.

    This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

    {'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}


    The PDF appears to have the fonts embedded.

    Font List
    ['/HPCHGJ+DigiHolsatia-Halbfett',
    '/HPCHGK+DigiHolsatia-Mager',
    '/HPCHHL+DigiHolsatia-Normal',
    '/HPCHMK+AdvPi1',
    '/HPCIFI+Symbol']

    Unembedded Fonts
    []

     

    Anyone any ideas??

    Thanks!

    Participant
    July 27, 2023

    Does anyone have any ind´sights? Would be much appreciated as I am stuck.