Copy link to clipboard
Copied
Hi,
after extraction, we sometimes have elements like these, where there is a parent with Text (here "NO") and a child which contains a part of the text, with different font properties (here "2", which should be "N2O" if combined). My question is: How can one combine such an inline element at the right position in the text-flow?
In the docs (Extract API | How Tos | PDF Extract API | Adobe PDF Services) I see:
Text: Text for the element in UTF-8 format, only reported for text elements. When inline elements are reported separately from parent block element, then this value has references to those inline elements.
How can I see such a reference to the inline element? Or is this meant differently?
...
{
"Bounds": [
211.8,
333.0,
222.9,
398.5
],
"Font": {
"alt_family_name": "Times",
"embedded": true,
"encoding": "MacRomanEncoding",
"family_name": "Times",
"font_type": "Type1",
"italic": false,
"monospaced": false,
"name": "...+Times-Roman",
"subset": true,
"weight": 400
},
"HasClip": true,
"ObjectID": 93,
"Page": 0,
"Path": "//Document/Sect/P[15]/Sub[3]",
"Rotation": 90.0,
"Text": "NO ",
"TextSize": 9.2,
"attributes": {
"Placement": "Block"
}
},
{
"Bounds": [
215.7,
339.7,
222.9,
344.5
],
"Font": {
"alt_family_name": "Times",
"embedded": true,
"encoding": "MacRomanEncoding",
"family_name": "Times",
"font_type": "Type1",
"italic": false,
"monospaced": false,
"name": "...+Times-Roman",
"subset": true,
"weight": 400
},
"HasClip": true,
"ObjectID": 94,
"Page": 0,
"Path": "//Document/Sect/P[15]/Sub[3]/StyleSpan",
"Rotation": 90.0,
"Text": "2 ",
"TextSize": 6.4,
"attributes": {
"TextPosition": "Sup"
}
}
...
Copy link to clipboard
Copied
Can you share the input PDF? You might have to turn on Character Boundaries (CharBounds) to get the order right.
Copy link to clipboard
Copied
Hi, I can only share the pdf confidently. But CharBounds sounds interesting, how to enable that?
Copy link to clipboard
Copied
Oh, wait, that will just give you the boundaries of individual characters? So, it's on the user side to merge these Spans? If so, that sounds like error-prone and a lot of work.
Copy link to clipboard
Copied
In the REST API, https://developer.adobe.com/document-services/docs/apis/#tag/Extract-PDF/operation/pdfoperations.ext... it is getCharBounds.
Copy link to clipboard
Copied
You can send it as a direct message to me. I'll just "look" at it. I never actually read these things. I just look at the words and their positions on the page.
Copy link to clipboard
Copied
Okay, so, it'll be custom heuristics then to merge the text. Not sure if it's worth it for us then, but thanks for the response! If there is any other simple option, where I don't have to write my own heuristics, I'd be interested.
Copy link to clipboard
Copied
I'm afraid that's the only way right now. I noticed that there are super and sub scripts in the area you posted. We read the text from top to bottom so depending on exactly how super or sub the text is, we can get the order wrong unless you "look" at it the way the AI would where super scripts come first, then the normal text, then the subscripts. We are currently working on better "formula" recognition for exactly this reason.
Until then, if super/sub script are the only occurrence of this issue, you won't have to examine every bit of text, just the parts with these characteristics. It might be woth it at that point. If I can get even a page of the original content, I can probably write some sample code. I love those little puzzles. You can direct message the file to me.