Help Needed: Identifying and Processing Main Table Elements in JSON

Report · Jun 13, 2024

Hello everyone,

I'm currently working on a project where I need to process a JSON file that represents the structure of a PDF document. This JSON file includes various elements, some of which are tables. However, the JSON data includes references to table elements for every text present within the tables, making it challenging to identify the main table structures separately from their contents.

Here's an example snippet from the JSON data:

{
"elements": [
{
"Bounds": [56.69189453125, 40.66029357910156, 551.7269134521484, 673.7546997070312],
"ObjectID": 109,
"Page": 1,
"Path": "//Document/Sect[4]/Table",
"attributes": {
"BBox": [42.47829999999885, 40.922599999999875, 553.0569999999716, 679.8969999999972],
"NumCol": 3,
"NumRow": 58,
"Placement": "Block",
"SpaceAfter": 18
},
"filePaths": ["tables/fileoutpart0.csv", "tables/fileoutpart1.png"]
},
{
"Bounds": [56.692901611328125, 661.3946990966797, 104.927001953125, 673.7546997070312],
"Font": {
"alt_family_name": "SMA Futura Global",
"embedded": true,
"encoding": "WinAnsiEncoding",
"family_name": "SMA Futura Global",
"font_type": "TrueType",
"italic": false,
"monospaced": false,
"name": "GSXDMC+SMAFuturaGlobal-DemiBold",
"subset": true,
"weight": 600
},
"Lang": "en",
"ObjectID": 1572,
"Page": 1,
"Path": "//Document/Sect[4]/Table/TR/TH/P",
"Text": "Technical Data",
"TextSize": 7.5,
"attributes": {"LineHeight": 9}
}
// More elements...
]
}

Certainly! Here is a draft for a community post asking for help on identifying and processing main table elements in a JSON file:

Title: Help Needed: Identifying and Processing Main Table Elements in JSON

Hello everyone,

I'm currently working on a project where I need to process a JSON file that represents the structure of a PDF document. This JSON file includes various elements, some of which are tables. However, the JSON data includes references to table elements for every text present within the tables, making it challenging to identify the main table structures separately from their contents.

Here's an example snippet from the JSON data:

json

Copy code

{ "elements": [ { "Bounds": [56.69189453125, 40.66029357910156, 551.7269134521484, 673.7546997070312], "ObjectID": 109, "Page": 1, "Path": "//Document/Sect[4]/Table", "attributes": { "BBox": [42.47829999999885, 40.922599999999875, 553.0569999999716, 679.8969999999972], "NumCol": 3, "NumRow": 58, "Placement": "Block", "SpaceAfter": 18 }, "filePaths": ["tables/fileoutpart0.csv", "tables/fileoutpart1.png"] }, { "Bounds": [56.692901611328125, 661.3946990966797, 104.927001953125, 673.7546997070312], "Font": { "alt_family_name": "SMA Futura Global", "embedded": true, "encoding": "WinAnsiEncoding", "family_name": "SMA Futura Global", "font_type": "TrueType", "italic": false, "monospaced": false, "name": "GSXDMC+SMAFuturaGlobal-DemiBold", "subset": true, "weight": 600 }, "Lang": "en", "ObjectID": 1572, "Page": 1, "Path": "//Document/Sect[4]/Table/TR/TH/P", "Text": "Technical Data", "TextSize": 7.5, "attributes": {"LineHeight": 9} } // More elements... ] }

As you can see, the JSON includes both main table elements (e.g., //Document/Sect[4]/Table) and individual text elements within the table (e.g., //Document/Sect[4]/Table/TR/TH/P).

Objective: I need to identify and process only the main table elements to replace them with corresponding data from Excel files. The goal is to skip the individual text elements within the tables and focus on the main table structures.

Request: I would appreciate any advice or help on:

Refining the approach to accurately identify main table elements.
Best practices for processing these main table elements

Thank you in advance for your help!

Best regards,

Amith

Report · Jun 13, 2024

I don't use the JSON Path for understanding tables. I read the JSON until I get to a table element then I switch over to read the .xlsx file. I export tables as .xlsx because unlike .csv, it retains the merged cells. I then process the .xlsx and then skip over the table elements until I'm back to regular paragraphs.

Help Needed: Identifying and Processing Main Table Elements in JSON

1 Correct answer