Extracting data from PDF to build HTML

Report · Aug 31, 2021

Used the PDF Extract API to extract info from PDF and build an HTML. Would be helpful if you can answer the following questions regarding the API:

1. Can you share details on how to identify the location of elements from the json output? "Bounds" key in the JSON seems to have info. But unsure if that's the right location info.

2. The documentation page here states that 'The output does not include headers or footers'. Is there any way this can be accessed?

3. The documentation page here mentions that 'headings that repeat across pages are reported for the first occurrence only.'. Does the structuredData include information about subsequent occurrences about the header? If not can you share details on how to identify different locations of the same header?

4. Text overlayed on an image (Test, $10M+ Rev) in this PDF (attached as PHT4) is extracted as a part of the image and not as text. The sections highlighted in the image are added as text. Anyway to extract them as text and not image?

5. The below section in the PDF has a gradient background with a surrounding boundary. Can you help in locating this information in the json output?

6. The line of text below has different formatting for different words. Had noted somewhere in the doc that formatting is only based on first character in a line. But, I am unable to find formatting info.

Here is the relevant section from the json:

  {
      "Bounds": [
        72.02400207519531,
        721.4035034179688,
        271.4056091308594,
        742.0164947509766
      ],
      "ClipBounds": [
        72.02400207519531,
        721.4035034179688,
        271.4056091308594,
        742.0164947509766
      ],
      "Font": {
        "alt_family_name": "Calibri",
        "embedded": true,
        "encoding": "WinAnsiEncoding",
        "family_name": "Calibri",
        "font_type": "TrueType",
        "italic": false,
        "monospaced": false,
        "name": "BCDGEE+Calibri-Light",
        "subset": true,
        "weight": 300
      },
      "HasClip": true,
      "Lang": "en",
      "Page": 0,
      "Path": "//Document/Sect/P",
      "Text": "PDF to HTML Conversion ",
      "TextSize": 20.039993286132812,
      "attributes": {
        "LineHeight": 21.625,
        "SpaceAfter": 11.375
      }
    }

Clarity on the above 6 queries would help in integrating and consuming the API.

Thanks!

Report · Aug 31, 2021

Some of these are issues that I am also facing, would be very curious to understand what can be done here.

Report · Aug 31, 2021

Just a small point on Q.6. The example does not show so many text formats as you suggest. Text formatting is ONLY a font and size, possibly an outline style. Nothing else. As a graphical object it also has a colour. Underlining is lines drawn on the page near text. Highlighting is boxes drawn behind the text, or highlighting annotations. "Background colour" is not a text attribute in PDF.

Detailed study of ISO 32000-1 (or -2) may help understand the specific limits of what is actually represented as an entity in PDF, and what will be simulated using multiple graphical or other entities not associated with each other. Study of this would show you, for example, that there is no such thing as a table.

Re Q:5. A gradient may be

- a single gradient object giving end colours and mapping rules (very complex). Rarely actually found.

- a scaled image

- multiple separate graphical objects

Report · Aug 31, 2021

Thanks for your response @testm.

Q.6 (Regarding text formatting):

So, is there no way to extract the formatting information of the text from the API as of now?

Q.5 Was unable to find the gradient info in any form in the output. Anyway to access gradient information? An image or something else.

Would be helpful if you can respond to other queries.

Thanks!

Report · Aug 31, 2021

"So, is there no way to extract the formatting information of the text from the API as of now?" This is not a software limitation, it is a design aim of PDF. So it is working correctly.

I do not have any knowledge of the API in question, I just look for questions relating more generally to PDF internals.

Report · Aug 31, 2021

Ok. Thanks for the clarity.

Are there any other ways to extract underline, background color information from the PDF and apply them to text in the republished content?

Report · Aug 31, 2021

I don't know whether this API supports it, but the general process for PDF extraction would be to identify the bounding box of each text object and each graphical object; identify candidates for background or underline by analysis of the graphic properties; try to match up, allowing for variation in baseline, broken or continuous underline, and anything else that comes up. Be under no illusion: this is a very challenging operation, but not as challenging as your project to convert to HTML. One could say: if this task was easy and practical, then there was no need for PDF in the first place. A common approach is to simply rasterise the PDF using existing libraries, and serve it as graphics; a level 2 approach might add hidden text for searchability.

Adobe do offer a very developed engine for delivering PDFs live as HTML, based on their 25 years experience and massive codebase; I obviously don't know why you have rejected this.

Report · Aug 31, 2021

When exploring other available APIs found Export API. On exporting to .docx an output, also attached here, nearly identical to the PDF was generated.

With this I am guessing we should also be able to Extract this information from the Extract API, shouldn't we?

Report · Aug 31, 2021

Adobe also have 20+ years of converting PDF to usable DOC files - some people can't use the result, but it's a good job. Was it built on PDF extraction? Certainly. Was it built on THIS PDF extraction? No. Can it be built on this one? Maybe, provided it extracts all the raw info; I don't know how complete or selective it is, nor have I see any specification of what it does in detail. I may be wrong but I think you're just supposed to pick around in the JSON and see if you get what you need.

What you WON'T get is the results of 20 years of transforming the extracted data for use in DOC extraction. How about using that and converting DOC to PDF - may well be much more practical...

Report · Aug 31, 2021

Thanks for adding the reason and history behind such precision in conversion. Yet to find ways in which more info can be extracted from the Extract API. Would highly appreciate if you could share the same when you come across the info.

Can you also forward/point the other questions to someone who would have the additional context to answer them?

Thanks!

Report · Mar 24, 2022

Hi,

Just curious to know if you got the answer for the first question that you posted? I'm trying to interpert the bound value to locate the position of the text from extract api output. Any example to illustrate will be of great hellp.

Thanks

Report · May 22, 2023

Hey SivaK, Is this what you are looking for?

meter = 0.0254 # (1 Inch)

page_height = 792

x = (bounds[0] / 72) * meter
y = ((page_height - bounds[3]) / 72) * meter

width = ((bounds[2] - bounds[0]) / 72) * meter
height = ((bounds[3] - bounds[1]) / 72) * meter

Report · Sep 03, 2021

Clarity on the above from the Adobe team would greatly help.

Report · Sep 13, 2021

Hi @Nikhil Ranka ,

Clarity on the above from the Adobe team would greatly help.

By @Nikhil Ranka

That is correct. Those are the coordinates of the location of the element on the page. The difference between this and the ClipBounds is that something technically can be clipped (or cropped). In the practical example above, they are the same.

Clarity on the above from the Adobe team would greatly help.

By @Nikhil Ranka

This is something that has been requested and in our roadmap to include.

Clarity on the above from the Adobe team would greatly help.

By @Nikhil Ranka

Can you help illustrate the scenario of the headings repeating across pages in your scenario? There are many scenarios where that might occur and I would probably handle it a little differently.

Clarity on the above from the Adobe team would greatly help.

By @Nikhil Ranka

If it is recognizing it as part of the image and not as text, then it probably is not picking it up. You could try extracting the image out, running OCR on the image to get that text out, but not as part of PDF Extract API. If it recognizes it as part of the image there isn't any configuration change you can make. You can only try taking the images and seeing if you can extract the data out of the images separately.

Report · Sep 13, 2021

@Ben Vanderberg

Can you help illustrate the scenario of the headings repeating across pages in your scenario? There are many scenarios where that might occur and I would probably handle it a little differently.

In a PDF report containing measurements/stats about different aspects of a machine, the heading measurements/stats would be repeated. Can you share your insights on how can one detect repetition of a heading?

You could try extracting the image out, running OCR on the image to get that text out, but not as part of PDF Extract API.

Does the API follow any specific pattern? ie: if text is overlayed over an image it will be rasterized? Without an understanding of the rule followed it becomes cumbersome to republish the pdf into html/other formats, because one needs to figure out using OCR if an image has text or not. In cases where images are overlayed with placeholders, eg: {firstName}, for dynamic replacement no replacement would take place.

PS: Appears that in your response, the same section of my message is quoted. Was simpler for me to understand. But it would be difficult for someone else to get information.