Adobe Extract API Hyperlink is a different element then the string it is in

Report · Oct 15, 2021

hello,

I am using the document extract REST api to parse pdf's that include hyperlinks. from my testing it looks like the api parses the hyperlink as a different element then the sentance that it is in. from what i can tell there is no indication of where the hyper link was located in the string so i can reassemble it, as it leaves no symbol and eats the trailing line space.

below is the content analyzer request i have been using. the documentation has given me no clue as how to either get it to ignore hyperlinks, or otherwise indicate where they were extracted from.

{
    "cpf:engine": {
      "repo:assetId": "urn:aaid:cpf:58af6e2c-1f0c-400d-9188-078000185695"
    },
    "cpf:inputs": {
      "documentIn": {
        "cpf:location": "InputFile0",
        "dc:format": "application/pdf"
      },
      "params": {
        "cpf:inline": {
          "elementsToExtract": [
            "text",
            "tables"
          ]
        }
      }
    },
    "cpf:outputs": {
      "elementsInfo": {
        "cpf:location": "jsonoutput",
        "dc:format": "application/json"
      },
      "elementsRenditions": {
        "cpf:location": "fileoutpart",
        "dc:format": "text/directory"
      }
    }
  }
}

Any help would be appreciated!

Report · Oct 16, 2021

That is the design of PDF files. There is text on a page, and there are also hyperlinks identified by rectangles. There is no PDF connection between the text and the hyperlink. Working out which text forms a link requires comparing the position of each character with the position of each link rectangle.

Report · Oct 26, 2021

I'm reporting this as a bug. Hopefully they'll escalate it.

Adobe Community

Adobe Extract API Hyperlink is a different element then the string it is in