Issue with parsing Large PDFs, and conveting JSON to consistently applied html

Forum|Forum|3 years ago
December 5, 2022
1 reply
2083 views

I'm having issue converting a large PDF, 100+ pages, with images and complex tables (attached PDF and converted html)

Some paragraphs lead with numbers and they are being assigned in separate divs, and are overlapping the paragraphbtext they are assigned to.

There are also large white spaces where headers/footer/page breaks are

Any help appreciated

PDF Extract API

This topic has been closed for replies.

M

marko30856491Author

Participating Frequently

I've attached PDF and output json

IE_SPC_WS-2187_OPDIVO_EN_PI_clean.pdf

structuredData.zip

Joel Geraci

Community Expert

I don't think the structuredData.json belongs with that PDF. The JSON contains a document title "SUMMARY OF PRODUCT CHARACTERISTICS" that does not appear in the document.

M

marko30856491Author

Participating Frequently

Extract doesn't pull metadata from the document. The text "SUMMARY OF PRODUCT CHARACTERISTICS" does not occur on any page in the PDF you uploaded to this thread. That means that the JSON does not belong to that file. I ran your document through Extract and have attached the proper JSON to this response. I've also attached a PDF containing annotations that "visualize" what Extract found. It looks fine to me. How exactly are you interpreting the JSON to create HTML? That might be where the issue is.

Hi Joel,

You are right, my problem is in the interpretation of the json. In your example I see that you were able to make page 108 correctly, but in my case it looks like this:

I'm using php for the html code:

Can you guide me a bit about the coordinates and if I should create any other element apart from the "div"?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded