Skip to main content
Participant
July 9, 2022
Question

Unreliable text extraction with Python SDK

  • July 9, 2022
  • 1 reply
  • 458 views

Hello,

I have been working with the api for text extraction with Python. I am using it both with not-well-structured tables and with very well defined documents as the ones of formularies of the public administration, pretty similiar to tables. I decided to work on the json files generated by the apis. Even with the very well defined ones I encounter two main problems:

- the api fails to spot 100% of the 'cells', ocassionaly joining two or more 'cells' into a single one ('Text' field of the json doc)

- the errors are not consistent, the output of processing the same document several times is not exactly the same; even worst: extracting text from the same document several times seems to increase the quantity of errors in the json output and their dimension, even joining the contents of a whole page in a single text cell of the json doc. 

I would need to understand the reason for the increase in errors and how to avoid the most serious ones, I need reliability and I have learned to deal with the fussion of 2 cells, but I need to be sure that the errors will not go further than that.

Thanks,

Pablo

This topic has been closed for replies.

1 reply

Legend
July 9, 2022

The Acrobat SDK doesn't have a Python API. Which API are you using?