We are using Adobe Extract API. We send a simple PDF page with 3 articles, each article has an image and each article is separated with a ruler line. The JSON returned is then converted to HTML.
Problem: We are finding the correct image is not being displayed with the relevant text. Also the elements of the PDF page are not listed in a workable mehtod. i.e. no identification of where articles start/stop.
I read that Adobe use Machine Learning to analyse PDFs, so this basic level of functionality was assumed to be present. We are probably missing something obvious, but please can you advise?
I attatch the single page PDF which was processed by Adobe Extract API, the JSON returned and a PDF of the HTML we generated from the JSON (Sadly you can't attach html files here). The PDF of the HTML shows that the first article is with the wrong image. The JSON has referenced the wrong image with the text. It's not a complex page.
If the Adobe system can't place the right image with its relevant text then I can't see how it is of any use? Please tell me we're doing something wrong.
Extract API is interpreting the page layout as two-column and from its perspective, the order of the elements is accurate. The layout isn't complex to humans... but this is an AI. I'm very used to looking at the Extract output so I'm a bit biased here but... by looking at the Path property of each element it's fairly simple to see where one article starts and stops. Each article H1 element starts a new Sect and that Sect continues until the next H1. That's an article. I can also look at the Bounds for each figure and the Bounds of each H1 to match up the upper left coordinates of H1 elements to figure elements creating the association of each image to its article.
Extract API isn't a PDF to HTML converter but it does give you all of the information required to produce a reasonable re-representation of the PDF in HTML. You just have to parse the JSON a bit more than just outputting HTML elements in the order they occur on the JSON. I've done this multiple times converting Extract output to HTML tables and lists.
Thank you, Joel_Geraci.
Your reply is very helpful to my project.
But I am still confused.
- First of all, I could see an unmatched sect number in the extracted JSON file.
Because of that, I think I could not rely on it to group the elements by article.
- Next, I have no idea how to use bound information to the HTML.
Because our HTML is responsive, I could not use the absolute boundary information which is provided in extracted JSON.
Could you help me share your experience of this part, please? Thank you.
Would a CodePen be helpful?
Yes. I would be pleased if you provide me some example code. Thank you.