Issue with parsing Large PDFs, and conveting JSON to consistently applied html

Forum|Forum|3 years ago
December 5, 2022
1 reply
2083 views

I'm having issue converting a large PDF, 100+ pages, with images and complex tables (attached PDF and converted html)

Some paragraphs lead with numbers and they are being assigned in separate divs, and are overlapping the paragraphbtext they are assigned to.

There are also large white spaces where headers/footer/page breaks are

Any help appreciated

PDF Extract API

This topic has been closed for replies.

M

marko30856491Author

Participating Frequently

I've attached PDF and output json

IE_SPC_WS-2187_OPDIVO_EN_PI_clean.pdf

structuredData.zip

Joel Geraci

Community Expert

I don't think the structuredData.json belongs with that PDF. The JSON contains a document title "SUMMARY OF PRODUCT CHARACTERISTICS" that does not appear in the document.

M

marko30856491Author

Participating Frequently

Hi Joel,

No, it is the correct JSON, the 'Summary of Product Charecteristics' is just a document name and type of document, not necessarily whats contained in the doc itself

Any help appreciated. It is a very complex document, but it is our best test case. We would be converting 2000+ documents per month, and we need 100% accuracy, as it is a regulated space.

Thnaks,

Mark

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded