Adobe Extract API Problem with Structure

Question

We are using Adobe Extract API. We send a simple PDF page with 3 articles, each article has an image and each article is separated with a ruler line. The JSON returned is then converted to HTML.

Problem: We are finding the correct image is not being displayed with the relevant text. Also the elements of the PDF page are not listed in a workable mehtod. i.e. no identification of where articles start/stop.

I read that Adobe use Machine Learning to analyse PDFs, so this basic level of functionality was assumed to be present. We are probably missing something obvious, but please can you advise?

I attatch the single page PDF which was processed by Adobe Extract API, the JSON returned and a PDF of the HTML we generated from the JSON (Sadly you can't attach html files here). The PDF of the HTML shows that the first article is with the wrong image. The JSON has referenced the wrong image with the text. It's not a complex page.

If the Adobe system can't place the right image with its relevant text then I can't see how it is of any use? Please tell me we're doing something wrong.

Joel Geraci · Accepted Answer

Would a CodePen be helpful?

Joel Geraci · Answer

Extract API is interpreting the page layout as two-column and from its perspective, the order of the elements is accurate. The layout isn't complex to humans... but this is an AI. I'm very used to looking at the Extract output so I'm a bit biased here but... by looking at the Path property of each element it's fairly simple to see where one article starts and stops. Each article H1 element starts a new Sect and that Sect continues until the next H1. That's an article. I can also look at the Bounds for each figure and the Bounds of each H1 to match up the upper left coordinates of H1 elements to figure elements creating the association of each image to its article.

Extract API isn't a PDF to HTML converter but it does give you all of the information required to produce a reasonable re-representation of the PDF in HTML. You just have to parse the JSON a bit more than just outputting HTML elements in the order they occur on the JSON. I've done this multiple times converting Extract output to HTML tables and lists.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.