Unreliable text extraction with Python SDK
Hello,
I have been working with the api for text extraction with Python. I am using it both with not-well-structured tables and with very well defined documents as the ones of formularies of the public administration, pretty similiar to tables. I decided to work on the json files generated by the apis. Even with the very well defined ones I encounter two main problems:
- the api fails to spot 100% of the 'cells', ocassionaly joining two or more 'cells' into a single one ('Text' field of the json doc)
- the errors are not consistent, the output of processing the same document several times is not exactly the same; even worst: extracting text from the same document several times seems to increase the quantity of errors in the json output and their dimension, even joining the contents of a whole page in a single text cell of the json doc.
I would need to understand the reason for the increase in errors and how to avoid the most serious ones, I need reliability and I have learned to deal with the fussion of 2 cells, but I need to be sure that the errors will not go further than that.
Thanks,
Pablo
