Skip to main content
Participating Frequently
December 5, 2022
Question

Issue with parsing Large PDFs, and conveting JSON to consistently applied html

  • December 5, 2022
  • 1 reply
  • 2083 views

I'm having issue converting a large PDF, 100+ pages, with images and complex tables (attached PDF and converted html)

 

Some paragraphs lead with numbers and they are being assigned in separate divs, and are overlapping the paragraphbtext they are assigned to.

 

There are also large white spaces where headers/footer/page breaks are

 

Any help appreciated

This topic has been closed for replies.

1 reply

Participating Frequently
December 5, 2022
Joel Geraci
Community Expert
Community Expert
December 5, 2022

I don't think the structuredData.json belongs with that PDF. The JSON contains a document title "SUMMARY OF PRODUCT CHARACTERISTICS" that does not appear in the document.

Participating Frequently
December 6, 2022

Hi Joel,

 

No, it is the correct JSON, the 'Summary of Product Charecteristics' is just a document name and type of document, not necessarily whats contained in the doc itself

Any help appreciated. It is a very complex document, but it is our best test case. We would be converting 2000+ documents per month, and we need 100% accuracy, as it is a regulated space.

 

Thnaks,

Mark