• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Extraction: How to position span elements?

New Here ,
Jul 17, 2024 Jul 17, 2024

Copy link to clipboard

Copied

Hi,
after extraction, we sometimes have elements like these, where there is a parent with Text (here "NO") and a child which contains a part of the text, with different font properties (here "2", which should be "N2O" if combined). My question is: How can one combine such an inline element at the right position in the text-flow?

In the docs (Extract API | How Tos | PDF Extract API | Adobe PDF Services) I see:

  • Text: Text for the element in UTF-8 format, only reported for text elements. When inline elements are reported separately from parent block element, then this value has references to those inline elements.

How can I see such a reference to the inline element? Or is this meant differently?

 

 

...
        {
          "Bounds": [
            211.8,
            333.0,
            222.9,
            398.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 93,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]",
          "Rotation": 90.0,
          "Text": "NO ",
          "TextSize": 9.2,
          "attributes": {
            "Placement": "Block"
          }
        },
        {
          "Bounds": [
            215.7,
            339.7,
            222.9,
            344.5
          ],
          "Font": {
            "alt_family_name": "Times",
            "embedded": true,
            "encoding": "MacRomanEncoding",
            "family_name": "Times",
            "font_type": "Type1",
            "italic": false,
            "monospaced": false,
            "name": "...+Times-Roman",
            "subset": true,
            "weight": 400
          },
          "HasClip": true,
          "ObjectID": 94,
          "Page": 0,
          "Path": "//Document/Sect/P[15]/Sub[3]/StyleSpan",
          "Rotation": 90.0,
          "Text": "2 ",
          "TextSize": 6.4,
          "attributes": {
            "TextPosition": "Sup"
          }
        }
...

 

 

TOPICS
PDF Extract API

Views

271

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 18, 2024 Jul 18, 2024

Copy link to clipboard

Copied

Can you share the input PDF? You might have to turn on Character Boundaries (CharBounds) to get the order right.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 19, 2024 Jul 19, 2024

Copy link to clipboard

Copied

Hi, I can only share the pdf confidently. But CharBounds sounds interesting, how to enable that?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 19, 2024 Jul 19, 2024

Copy link to clipboard

Copied

Oh, wait, that will just give you the boundaries of individual characters? So, it's on the user side to merge these Spans? If so, that sounds like error-prone and a lot of work.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jul 19, 2024 Jul 19, 2024

Copy link to clipboard

Copied

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 19, 2024 Jul 19, 2024

Copy link to clipboard

Copied

You can send it as a direct message to me. I'll just "look" at it. I never actually read these things. I just look at the words and their positions on the page.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 22, 2024 Jul 22, 2024

Copy link to clipboard

Copied

Okay, so, it'll be custom heuristics then to merge the text. Not sure if it's worth it for us then, but thanks for the response! If there is any other simple option, where I don't have to write my own heuristics, I'd be interested.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 22, 2024 Jul 22, 2024

Copy link to clipboard

Copied

LATEST

I'm afraid that's the only way right now. I noticed that there are super and sub scripts in the area you posted. We read the text from top to bottom so depending on exactly how super or sub the text is, we can get the order wrong unless you "look" at it the way the AI would where super scripts come first, then the normal text, then the subscripts. We are currently working on better "formula" recognition for exactly this reason.

 

Until then, if super/sub script are the only occurrence of this issue, you won't have to examine every bit of text, just the parts with these characteristics. It might be woth it at that point. If I can get even a page of the original content, I can probably write some sample code. I love those little puzzles. You can direct message the file to me. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources