Adobe Extract API Hyperlink is a different element then the string it is in
Copy link to clipboard
Copied
hello,
I am using the document extract REST api to parse pdf's that include hyperlinks. from my testing it looks like the api parses the hyperlink as a different element then the sentance that it is in. from what i can tell there is no indication of where the hyper link was located in the string so i can reassemble it, as it leaves no symbol and eats the trailing line space.
below is the content analyzer request i have been using. the documentation has given me no clue as how to either get it to ignore hyperlinks, or otherwise indicate where they were extracted from.
{
"cpf:engine": {
"repo:assetId": "urn:aaid:cpf:58af6e2c-1f0c-400d-9188-078000185695"
},
"cpf:inputs": {
"documentIn": {
"cpf:location": "InputFile0",
"dc:format": "application/pdf"
},
"params": {
"cpf:inline": {
"elementsToExtract": [
"text",
"tables"
]
}
}
},
"cpf:outputs": {
"elementsInfo": {
"cpf:location": "jsonoutput",
"dc:format": "application/json"
},
"elementsRenditions": {
"cpf:location": "fileoutpart",
"dc:format": "text/directory"
}
}
}
}
Any help would be appreciated!
Copy link to clipboard
Copied
That is the design of PDF files. There is text on a page, and there are also hyperlinks identified by rectangles. There is no PDF connection between the text and the hyperlink. Working out which text forms a link requires comparing the position of each character with the position of each link rectangle.
Copy link to clipboard
Copied
I'm reporting this as a bug. Hopefully they'll escalate it.

