Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Undocumented Attributes

Community Beginner ,
Apr 27, 2023 Apr 27, 2023

Got some attributes in a response object that appear to be undocumented. I have a 100 page PDF and the majority of pages have their text extracted. The one in question though is instead extracted as an image. The JSON in question look like: 

{
			"Bounds": [
				79.85000610351563,
				108.1837158203125,
				546.1023712158203,
				714.25
			],
			"Page": 80,
			"Path": "//Document/Figure[3]",
			"attributes": {
				"BBox": [
					80.03669999999693,
					108.38799999999901,
					530.0179999999818,
					714.002999999997
				],
				"Placement": "Block",
				"Suspicion": "{ \"suspicious\" : true, \"reason\" : \"complexTable\" }",
				"SuspicionFBName": "region-complexTable",
				"Suspicious": true
			},
			"filePaths": [
				"figures/fileoutpart18.png"
			]
		}

Can anyone help me understand what the "Suspicion", "SuspicionFBName", and "Suspicious" attributes mean?

 

The data is formatted in a rough table for this section, but the table spans several pages and this is the only one to be extracted as an image. If I open the PDF in Reader I can select the text on that page just fine, it does not present any obvious difference from the pages around it.

TOPICS
PDF Extract API , REST APIs
1.4K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Adobe Employee , May 05, 2023 May 05, 2023

Ok, I was focused on the "unknown attributes" part. That's logged. I have another thread on the forum here about "page gets extract as image, not text", that's also a known bug. That would be a _separate_ issue. 

Translate
Community Expert ,
Apr 27, 2023 Apr 27, 2023

Can you share the PDF in question?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Apr 27, 2023 Apr 27, 2023

Unfortunately I cannot share the exact PDF generating this output.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Apr 28, 2023 Apr 28, 2023

Would sharing it privately be an option?

 

FYI, we are looking into this.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
May 04, 2023 May 04, 2023

Hey, sorry about the delayed response.

I unfortunately do not have the option to share the source PDF, know that makes the diagnosis difficult and makes it fall mostly on my head. I'm also not actually certain on how the PDF itself was generated.

 

Was hoping that knowing what the above properties mean would help me diagnose what is up w/ the file. At the very least I know I can add a check on the returned JSON to watch for those properties and flag a potential mis-extraction.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
May 04, 2023 May 04, 2023

As an FYI, we are still digging into this ourselves. It is our goal to ensure each and every change is documented properly. Not only is this not documented, it's not in the JSON schema. So this is a high priority thing for us. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
May 05, 2023 May 05, 2023

Ok, as I suspected, this is a bug, and the fields you saw should be removed when the bug is fixed. Basically, nothing to see here, move along, etc etc. 😉 Thank you for bringing this up though!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
May 05, 2023 May 05, 2023

Alrighty, sounds good. So once things are patched up we can expect that the page would extract correctly and not get detected as an image? Or just that these attributes would not be present in the returned JSON?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
May 05, 2023 May 05, 2023
LATEST

Ok, I was focused on the "unknown attributes" part. That's logged. I have another thread on the forum here about "page gets extract as image, not text", that's also a known bug. That would be a _separate_ issue. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources