Copy link to clipboard
Copied
Got some attributes in a response object that appear to be undocumented. I have a 100 page PDF and the majority of pages have their text extracted. The one in question though is instead extracted as an image. The JSON in question look like:
{
"Bounds": [
79.85000610351563,
108.1837158203125,
546.1023712158203,
714.25
],
"Page": 80,
"Path": "//Document/Figure[3]",
"attributes": {
"BBox": [
80.03669999999693,
108.38799999999901,
530.0179999999818,
714.002999999997
],
"Placement": "Block",
"Suspicion": "{ \"suspicious\" : true, \"reason\" : \"complexTable\" }",
"SuspicionFBName": "region-complexTable",
"Suspicious": true
},
"filePaths": [
"figures/fileoutpart18.png"
]
}
Can anyone help me understand what the "Suspicion", "SuspicionFBName", and "Suspicious" attributes mean?
The data is formatted in a rough table for this section, but the table spans several pages and this is the only one to be extracted as an image. If I open the PDF in Reader I can select the text on that page just fine, it does not present any obvious difference from the pages around it.
Ok, I was focused on the "unknown attributes" part. That's logged. I have another thread on the forum here about "page gets extract as image, not text", that's also a known bug. That would be a _separate_ issue.
Copy link to clipboard
Copied
Can you share the PDF in question?
Copy link to clipboard
Copied
Unfortunately I cannot share the exact PDF generating this output.
Copy link to clipboard
Copied
Would sharing it privately be an option?
FYI, we are looking into this.
Copy link to clipboard
Copied
Hey, sorry about the delayed response.
I unfortunately do not have the option to share the source PDF, know that makes the diagnosis difficult and makes it fall mostly on my head. I'm also not actually certain on how the PDF itself was generated.
Was hoping that knowing what the above properties mean would help me diagnose what is up w/ the file. At the very least I know I can add a check on the returned JSON to watch for those properties and flag a potential mis-extraction.
Copy link to clipboard
Copied
As an FYI, we are still digging into this ourselves. It is our goal to ensure each and every change is documented properly. Not only is this not documented, it's not in the JSON schema. So this is a high priority thing for us.
Copy link to clipboard
Copied
Ok, as I suspected, this is a bug, and the fields you saw should be removed when the bug is fixed. Basically, nothing to see here, move along, etc etc. 😉 Thank you for bringing this up though!
Copy link to clipboard
Copied
Alrighty, sounds good. So once things are patched up we can expect that the page would extract correctly and not get detected as an image? Or just that these attributes would not be present in the returned JSON?
Copy link to clipboard
Copied
Ok, I was focused on the "unknown attributes" part. That's logged. I have another thread on the forum here about "page gets extract as image, not text", that's also a known bug. That would be a _separate_ issue.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now