Seeking Solutions: Preserving Table Structure in JSON Output with Adobe PDF Extract API for RAG App

Question

Hello all! Iused adobe extract pdf API service to parse a pdf. Pdf and output JSON is attached to this message. I believe the Json output doesn't preserve the table structure. If I pass this data to an LLM, it is not able to answer relevant questions about this data as the table structure is not preserved. How should I go about this? I want to use adobe API to build a RAG application. Is there a way to preserve the table structure within the Json file? for example, I need outputs such as like this:

{

"Input (DC)":

{ "MVPS 4000-S2": null

, "MVPS 4200-S2": null

},

"Available inverters": {

"MVPS 4000-S2": "1 x SCS 3450 UP or 1 x SCS 3450 UP-XT",

"MVPS 4200-S2": "1 x SCS 3600 UP or 1 x SCS 3600 UP-XT"

},

"Max. input voltage": {

"MVPS 4000-S2": "1500 V",

"MVPS 4200-S2": "1500 V"

},

"Number of DC inputs": {

"MVPS 4000-S2": "dependent on the selected inverters",

"MVPS 4200-S2": null

},

"Integrated zone monitoring": {

"MVPS 4000-S2": "○",

"MVPS 4200-S2": null

},

"Available DC fuse sizes (per input)": {

"MVPS 4000-S2": "200 A, 250 A, 315 A, 350 A, 400 A, 450 A, 500 A",

"MVPS 4200-S2": null

},

I know it can also generate csv, but the csv doesnt have any other information that might be present in the pdf.

Joel Geraci · Accepted Answer

I generally post-process the JSON from extract to create a Markdown file. When I hit a table, I read past it, read in the .CSV as a Markdown table, then contuinue with the JSON. It works great. I have some Node.JS code I can share if you like.

k_1671 · Answer

I have tried to convert directly PDF to JSON via Adobe API. I just wanted to test it because of this mad pricing policy. However the result wasn't exact at all. I had complex tables with merged cells. The solution that worked for me was:

1. convert PDF to DOCX via eg CloudConvert (they provide much better pricing with credits)

2. then convert DOCX to JSON.

That worked perfectly!

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded