Skip to main content
Participant
January 25, 2023
Question

How to remove hyperlinks from pdf file with API service?

  • January 25, 2023
  • 1 reply
  • 673 views

Hello,

 

is there a way to remove all hyperlinks from a pdf document using one of the API services?

 

Thanks and regards

    This topic has been closed for replies.

    1 reply

    Legend
    January 26, 2023

    One thing to bear in mind (I have no specific answer for the API services) is that if your hyperlinks have text on the page with a URL (like http://something on the page), they will keep working no matter what you do; these work whether or not they are actual hyperlinks.

    Participant
    January 26, 2023

    That would be ok, I specifically want to remove any links/references to other pages in the document.
    The background is that the extract algorithm can't handle these links.


    Each normal word containing a hyperlink to another page is created as a separate object in the generated JSON file. The actual text does not contain this word anymore.

     

    I think this is a bug but a workaround would be to remove all links before extracting. I think this would be generally useful for the extract service because the link itself or the reference is not output in the JSON file. So at the moment there is no real added value to leave the hyperlinks in the document.

    Participant
    January 26, 2023

    Example:

    1. "b)  derived from Table 4."

    2. Table 4 has a hyperlink to the page with table 4 in the document

    JSON looks like:

     

    {
    			"Bounds": [
    				36.85040283203125,
    				609.50439453125,
    				47.90544128417969,
    				670.8843994140625
    			],
    			"ClipBounds": [
    				36.85040283203125,
    				609.50439453125,
    				47.90544128417969,
    				670.8843994140625
    			],
    			"Font": {
    				"alt_family_name": "Cambria",
    				"embedded": true,
    				"encoding": "Custom",
    				"family_name": "Cambria",
    				"font_type": "TrueType",
    				"italic": false,
    				"monospaced": false,
    				"name": "AAAAAC+Cambria",
    				"subset": true,
    				"weight": 400
    			},
    			"HasClip": true,
    			"Page": 25,
    			"Path": "//Document/L[39]/LI/LBody/L/LI[2]/Lbl",
    			"Text": "b) ",
    			"TextSize": 11.0
    		},
    		{
    			"Bounds": [
    				121.38999938964844,
    				609.50439453125,
    				155.33999633789062,
    				670.8843994140625
    			],
    			"ClipBounds": [
    				121.38999938964844,
    				609.50439453125,
    				155.33999633789062,
    				670.8843994140625
    			],
    			"Font": {
    				"alt_family_name": "Cambria",
    				"embedded": true,
    				"encoding": "Custom",
    				"family_name": "Cambria",
    				"font_type": "TrueType",
    				"italic": false,
    				"monospaced": false,
    				"name": "AAAAAC+Cambria",
    				"subset": true,
    				"weight": 400
    			},
    			"HasClip": true,
    			"Page": 25,
    			"Path": "//Document/L[39]/LI/LBody/L/LI[2]/LBody/StyleSpan/Reference",
    			"Text": "Table 4",
    			"TextSize": 11.0
    		},
    		{
    			"Bounds": [
    				56.96940612792969,
    				609.50439453125,
    				160.34469604492188,
    				670.8843994140625
    			],
    			"ClipBounds": [
    				56.96940612792969,
    				609.50439453125,
    				160.34469604492188,
    				670.8843994140625
    			],
    			"Font": {
    				"alt_family_name": "Cambria",
    				"embedded": true,
    				"encoding": "Custom",
    				"family_name": "Cambria",
    				"font_type": "TrueType",
    				"italic": false,
    				"monospaced": false,
    				"name": "AAAAAC+Cambria",
    				"subset": true,
    				"weight": 400
    			},
    			"HasClip": true,
    			"Page": 25,
    			"Path": "//Document/L[39]/LI/LBody/L/LI[2]/LBody",
    			"Text": "derived from . ",
    			"TextSize": 11.0
    		},

     

    1.