Skip to main content
Participant
July 24, 2023
Question

PDF Extract - Tables has poor OCR results

Hi,

 

Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction  - even when the documnet itself contains text and not scanned pages. These documents are German.


Common Mistakes

- Missing spaces e.g. 7bis 14 Tage , eskannunter

- Characters not converted with the accents e.g. Ü or Ö might become U or O.

- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas

- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.


This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?

Kind Regards,
Jono

    Ce sujet a été fermé aux réponses.

    1 commentaire

    Participant
    July 24, 2023

    Please see sample document and extracted results. I just noticed that if I cut and paste from Acrobat reader, I also see the same problems. Hmm, maybe something missing in the PDF, fonts..very odd.

    Participant
    July 24, 2023

    So contrary to original post I am guessing this is not an OCR issue.

    This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.

    {'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}


    The PDF appears to have the fonts embedded.

    Font List
    ['/HPCHGJ+DigiHolsatia-Halbfett',
    '/HPCHGK+DigiHolsatia-Mager',
    '/HPCHHL+DigiHolsatia-Normal',
    '/HPCHMK+AdvPi1',
    '/HPCIFI+Symbol']

    Unembedded Fonts
    []

     

    Anyone any ideas??

    Thanks!

    Participant
    July 27, 2023

    Does anyone have any ind´sights? Would be much appreciated as I am stuck.