Skip to main content
Participant
July 17, 2024
Question

Extraction: How to position span elements?

  • July 17, 2024
  • 1 reply
  • 789 views

Hi,
after extraction, we sometimes have elements like these, where there is a parent with Text (here "NO") and a child which contains a part of the text, with different font properties (here "2", which should be "N2O" if combined). My question is: How can one combine such an inline element at the right position in the text-flow?

In the docs (Extract API | How Tos | PDF Extract API | Adobe PDF Services) I see:

  • Text: Text for the element in UTF-8 format, only reported for text elements. When inline elements are reported separately from parent block element, then this value has references to those inline elements.

How can I see such a reference to the inline element? Or is this meant differently?

 

 

...
        {
          "Bounds": [
            211.8,
            333.0,
            222.9,
            398.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 93,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]",
          "Rotation": 90.0,
          "Text": "NO ",
          "TextSize": 9.2,
          "attributes": {
            "Placement": "Block"
          }
        },
        {
          "Bounds": [
            215.7,
            339.7,
            222.9,
            344.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 94,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]/StyleSpan",
          "Rotation": 90.0,
          "Text": "2 ",
          "TextSize": 6.4,
          "attributes": {
            "TextPosition": "Sup"
          }
        }
...

 

 

This topic has been closed for replies.

1 reply

Joel Geraci
Community Expert
Community Expert
July 18, 2024

Can you share the input PDF? You might have to turn on Character Boundaries (CharBounds) to get the order right.

Participant
July 19, 2024

Hi, I can only share the pdf confidently. But CharBounds sounds interesting, how to enable that?

Participant
July 19, 2024

Oh, wait, that will just give you the boundaries of individual characters? So, it's on the user side to merge these Spans? If so, that sounds like error-prone and a lot of work.