Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Issue with parsing Large PDFs, and conveting JSON to consistently applied html

Community Beginner ,
Dec 05, 2022 Dec 05, 2022

I'm having issue converting a large PDF, 100+ pages, with images and complex tables (attached PDF and converted html)

 

Some paragraphs lead with numbers and they are being assigned in separate divs, and are overlapping the paragraphbtext they are assigned to.

 

There are also large white spaces where headers/footer/page breaks are

 

Any help appreciated

TOPICS
PDF Extract API
1.7K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 05, 2022 Dec 05, 2022

I've attached PDF and output json

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 05, 2022 Dec 05, 2022

I don't think the structuredData.json belongs with that PDF. The JSON contains a document title "SUMMARY OF PRODUCT CHARACTERISTICS" that does not appear in the document.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 06, 2022 Dec 06, 2022

Hi Joel,

 

No, it is the correct JSON, the 'Summary of Product Charecteristics' is just a document name and type of document, not necessarily whats contained in the doc itself

marko30856491_0-1670320973310.pngexpand image

Any help appreciated. It is a very complex document, but it is our best test case. We would be converting 2000+ documents per month, and we need 100% accuracy, as it is a regulated space.

 

Thnaks,

Mark

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 06, 2022 Dec 06, 2022

Extract doesn't pull metadata from the document. The text "SUMMARY OF PRODUCT CHARACTERISTICS" does not occur on any page in the PDF you uploaded to this thread. That means that the JSON does not belong to that file. I ran your document through Extract and have attached the proper JSON to this response. I've also attached a PDF containing annotations that "visualize" what Extract found. It looks fine to me. How exactly are you interpreting the JSON to create HTML? That might be where the issue is.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 06, 2022 Dec 06, 2022

Hi Joel,

You are right, my problem is in the interpretation of the json. In your example I see that you were able to make page 108 correctly, but in my case it looks like this:

Screenshot 2022-12-06 at 3.05.14 PM.pngexpand image

 I'm using php for the html code:

Screenshot 2022-12-06 at 3.03.36 PM.pngexpand image

 

Can you guide me a bit about the coordinates and if I should create any other element apart from the "div"?

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 07, 2022 Dec 07, 2022

Hi @Joel Geraci have you seen the above? Just wondering if you have any suggestions. Thanks for the help

 

Mark

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 12, 2022 Dec 12, 2022
LATEST

Hey @Joel Geraci , just checking on the above comments you made. Can you help provide guidance or is there another user in the community who clould if you are too busy? 

 

Thanks

Mark

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources