PDF Extraction never the same format

Report · Jul 14, 2022

Hi, I just started trying your pdf extraction api for a client, it's seems to be working fine but I am having issues with how to get the needed data while all the PDFs are visually exactly the same. Therefore, the paths, rows and index are never the same for the same elements.

Here is an exemple :

For the first invoice PDF, if I am trying to get the first "Cat. No." :

Path : //Document/Sect/Table/TR[7]/TD[3]/P

Index : 68

Row count : 6

For the second invoice, first "Cat. No." :

Path : //Document/Sect/Table[2]/TR[2]/TD[3]/P

Index : 60

Row count : 5

Am I doing something wrong or is the API not precise enough?

Thank you for your help!

Report · Jul 14, 2022

The paths are calculated on a document-by-document basis. There are probably differences that the AI sees that we don't. When you Extract with tables, what do the tables look like? The extracted tables should be consistent.

Adobe Community

PDF Extraction never the same format