I am using the document extract REST api to parse pdf's that include hyperlinks. from my testing it looks like the api parses the hyperlink as a different element then the sentance that it is in. from what i can tell there is no indication of where the hyper link was located in the string so i can reassemble it, as it leaves no symbol and eats the trailing line space.
below is the content analyzer request i have been using. the documentation has given me no clue as how to either get it to ignore hyperlinks, or otherwise indicate where they were extracted from.
That is the design of PDF files. There is text on a page, and there are also hyperlinks identified by rectangles. There is no PDF connection between the text and the hyperlink. Working out which text forms a link requires comparing the position of each character with the position of each link rectangle.