Copy link to clipboard
Copied
Hi,
Quite a number of tables seem to be extracted using OCR. Unfortunately, this results in errors in extraction - even when the documnet itself contains text and not scanned pages. These documents are German.
Common Mistakes
- Missing spaces e.g. 7bis 14 Tage , eskannunter
- Characters not converted with the accents e.g. Ü or Ö might become U or O.
- Superscripts are often used in table to refer to a key - these may be a list of numbers, leters or special symbols and may be sperated by commas
- numbers where there are special characters e.g. >169 , 4169 are often converted to all numbers.
This all adds up to a lack of confidence in table extraction. Are there any options to set langauge? How is the OCR done? Can settings be applied?
Kind Regards,
Jono
Copy link to clipboard
Copied
Copy link to clipboard
Copied
So contrary to original post I am guessing this is not an OCR issue.
This is the font where the > symbol is being replaced on cut and paste and in the extracted text from the API.
{'alt_family_name': 'Adv Pi', 'embedded': True, 'encoding': 'WinAnsiEncoding', 'family_name': 'Adv Pi 1', 'font_type': 'Type1', 'italic': False, 'monospaced': False, 'name': 'HPCHMK+AdvPi1', 'subset': True, 'weight': 400}
The PDF appears to have the fonts embedded.
Font List
['/HPCHGJ+DigiHolsatia-Halbfett',
'/HPCHGK+DigiHolsatia-Mager',
'/HPCHHL+DigiHolsatia-Normal',
'/HPCHMK+AdvPi1',
'/HPCIFI+Symbol']
Unembedded Fonts
[]
Anyone any ideas??
Thanks!
Copy link to clipboard
Copied
Does anyone have any ind´sights? Would be much appreciated as I am stuck.
Copy link to clipboard
Copied
as far as I have read, OCR was tuned on English. So there is a chance, german doesn`t work that well