Adobe Extract API Problem with Structure
- June 30, 2021
- 1 reply
- 1962 views
We are using Adobe Extract API. We send a simple PDF page with 3 articles, each article has an image and each article is separated with a ruler line. The JSON returned is then converted to HTML.
Problem: We are finding the correct image is not being displayed with the relevant text. Also the elements of the PDF page are not listed in a workable mehtod. i.e. no identification of where articles start/stop.
I read that Adobe use Machine Learning to analyse PDFs, so this basic level of functionality was assumed to be present. We are probably missing something obvious, but please can you advise?
I attatch the single page PDF which was processed by Adobe Extract API, the JSON returned and a PDF of the HTML we generated from the JSON (Sadly you can't attach html files here). The PDF of the HTML shows that the first article is with the wrong image. The JSON has referenced the wrong image with the text. It's not a complex page.
If the Adobe system can't place the right image with its relevant text then I can't see how it is of any use? Please tell me we're doing something wrong.
