Copy link to clipboard
Copied
Hello all! Iused adobe extract pdf API service to parse a pdf. Pdf and output JSON is attached to this message. I believe the Json output doesn't preserve the table structure. If I pass this data to an LLM, it is not able to answer relevant questions about this data as the table structure is not preserved. How should I go about this? I want to use adobe API to build a RAG application. Is there a way to preserve the table structure within the Json file? for example, I need outputs such as like this:
{
"Input (DC)":
{ "MVPS 4000-S2": null
, "MVPS 4200-S2": null
},
"Available inverters": {
"MVPS 4000-S2": "1 x SCS 3450 UP or 1 x SCS 3450 UP-XT",
"MVPS 4200-S2": "1 x SCS 3600 UP or 1 x SCS 3600 UP-XT"
},
"Max. input voltage": {
"MVPS 4000-S2": "1500 V",
"MVPS 4200-S2": "1500 V"
},
"Number of DC inputs": {
"MVPS 4000-S2": "dependent on the selected inverters",
"MVPS 4200-S2": null
},
"Integrated zone monitoring": {
"MVPS 4000-S2": "○",
"MVPS 4200-S2": null
},
"Available DC fuse sizes (per input)": {
"MVPS 4000-S2": "200 A, 250 A, 315 A, 350 A, 400 A, 450 A, 500 A",
"MVPS 4200-S2": null
},
I know it can also generate csv, but the csv doesnt have any other information that might be present in the pdf.
I generally post-process the JSON from extract to create a Markdown file. When I hit a table, I read past it, read in the .CSV as a Markdown table, then contuinue with the JSON. It works great. I have some Node.JS code I can share if you like.
Copy link to clipboard
Copied
I generally post-process the JSON from extract to create a Markdown file. When I hit a table, I read past it, read in the .CSV as a Markdown table, then contuinue with the JSON. It works great. I have some Node.JS code I can share if you like.
Copy link to clipboard
Copied
Hi Joel,
Thanks a lot for the reply! Yes, would really help if you can share your Node.JS code.
Copy link to clipboard
Copied
It's in a private git repo. If you are comfortable doing so, send me a private message with your github ID and I'll add you as a collaborator. I eventually plan on making it opensource once I'm past the work-in-progress.
Copy link to clipboard
Copied
Thanks Joel! I just sent you a personal message that has my Github ID.
Copy link to clipboard
Copied
were you able to get the table structure into the json? If you have the code please let me know, thanks
Copy link to clipboard
Copied
Hi Joel
Were you able to make it open source? If so can I also get acces to the code?
Thanks
Copy link to clipboard
Copied
I have tried to convert directly PDF to JSON via Adobe API. I just wanted to test it because of this mad pricing policy. However the result wasn't exact at all. I had complex tables with merged cells. The solution that worked for me was:
1. convert PDF to DOCX via eg CloudConvert (they provide much better pricing with credits)
2. then convert DOCX to JSON.
That worked perfectly!
Get ready! An upgraded Adobe Community experience is coming in January.
Learn more