Extraction: How to position span elements?

Report · Jul 17, 2024

Hi,
after extraction, we sometimes have elements like these, where there is a parent with Text (here "NO") and a child which contains a part of the text, with different font properties (here "2", which should be "N2O" if combined). My question is: How can one combine such an inline element at the right position in the text-flow?

In the docs (Extract API | How Tos | PDF Extract API | Adobe PDF Services) I see:

Text: Text for the element in UTF-8 format, only reported for text elements. When inline elements are reported separately from parent block element, then this value has references to those inline elements.

How can I see such a reference to the inline element? Or is this meant differently?

...
        {
          "Bounds": [
            211.8,
            333.0,
            222.9,
            398.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 93,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]",
          "Rotation": 90.0,
          "Text": "NO ",
          "TextSize": 9.2,
          "attributes": {
            "Placement": "Block"
          }
        },
        {
          "Bounds": [
            215.7,
            339.7,
            222.9,
            344.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 94,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]/StyleSpan",
          "Rotation": 90.0,
          "Text": "2 ",
          "TextSize": 6.4,
          "attributes": {
            "TextPosition": "Sup"
          }
        }
...

Report · Jul 18, 2024

Can you share the input PDF? You might have to turn on Character Boundaries (CharBounds) to get the order right.

Report · Jul 19, 2024

Hi, I can only share the pdf confidently. But CharBounds sounds interesting, how to enable that?

Report · Jul 19, 2024

Oh, wait, that will just give you the boundaries of individual characters? So, it's on the user side to merge these Spans? If so, that sounds like error-prone and a lot of work.

Report · Jul 19, 2024

In the REST API, https://developer.adobe.com/document-services/docs/apis/#tag/Extract-PDF/operation/pdfoperations.ext... it is getCharBounds.

Report · Jul 19, 2024

You can send it as a direct message to me. I'll just "look" at it. I never actually read these things. I just look at the words and their positions on the page.

Report · Jul 22, 2024

Okay, so, it'll be custom heuristics then to merge the text. Not sure if it's worth it for us then, but thanks for the response! If there is any other simple option, where I don't have to write my own heuristics, I'd be interested.

Report · Jul 22, 2024

I'm afraid that's the only way right now. I noticed that there are super and sub scripts in the area you posted. We read the text from top to bottom so depending on exactly how super or sub the text is, we can get the order wrong unless you "look" at it the way the AI would where super scripts come first, then the normal text, then the subscripts. We are currently working on better "formula" recognition for exactly this reason.

Until then, if super/sub script are the only occurrence of this issue, you won't have to examine every bit of text, just the parts with these characteristics. It might be woth it at that point. If I can get even a page of the original content, I can probably write some sample code. I love those little puzzles. You can direct message the file to me.