Skip to main content
Participant
December 27, 2021
Question

Some text unexpectedly turned into images by the PDF Extract API

  • December 27, 2021
  • 2 replies
  • 746 views

Hi,

 

I've been using the PDF Extarct API to display the content of a PDF as an HTML file, so far it's working well, but I'm seeing a couple of odd cases that you might be interested in. The following happens in a pdf that has sets of fill-in-the-blank questions.

 

For the folllwing text from the PDF:

 

SELECT c.name 'Channel',
COUNT(i.dataid) 'Count of Items'
FROM __________ i, __________ c
WHERE i.parentid = c.___________
AND i.____________ = 208
GROUP BY c.name, i._____________ ORDER BY COUNT(i.___________) DESC

 

The last line is extracted as an image instead of text, see below

 

This topic has been closed for replies.

2 replies

Joel Geraci
Community Expert
Community Expert
February 16, 2022

Can you share the actual PDF. I'd like to use it to train our AI.

Participant
February 23, 2022

Sorry, It's a client document that I can't share.

Participant
December 27, 2021

Here is the json that gets generated:

 

    {
      "Bounds": [
        231.8408966064453,
        331.61090087890625,
        435.8748016357422,
        364.3813934326172
      ],
      "Font": {
        "alt_family_name": "Courier New",
        "embedded": true,
        "encoding": "Identity-H",
        "family_name": "Courier New",
        "font_type": "CIDFontType2",
        "italic": false,
        "monospaced": false,
        "name": "EDGRTV+CourierNew,Bold",
        "subset": true,
        "weight": 700
      },
      "HasClip": false,
      "Lang": "en",
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/P[3]",
      "Text": "WHERE i.parentid = c.___________ AND i.____________ = 208 ",
      "TextSize": 10.5,
      "attributes": {
        "LineHeight": 12.5
      }
    },
    {
      "Bounds": [
        231.8199005126953,
        319.11590576171875,
        435.860107421875,
        339.3914031982422
      ],
      "Font": {
        "alt_family_name": "Courier New",
        "embedded": true,
        "encoding": "Identity-H",
        "family_name": "Courier New",
        "font_type": "CIDFontType2",
        "italic": false,
        "monospaced": false,
        "name": "EDGRTV+CourierNew,Bold",
        "subset": true,
        "weight": 700
      },
      "HasClip": false,
      "Lang": "en",
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/P[4]",
      "Text": "GROUP BY c.name, i._____________ ",
      "TextSize": 10.5,
      "attributes": {
        "LineHeight": 12.625,
        "SpaceAfter": 6.625
      }
    },
    {
      "Bounds": [
        231.8199005126953,
        306.6208953857422,
        448.57875061035156,
        326.8963928222656
      ],
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/Figure",
      "attributes": {
        "BBox": [
          231.8309999999983,
          311.41999999999825,
          445.9019999999873,
          322.2189999999973
        ],
        "Placement": "Block"
      },
      "filePaths": [
        "figures/fileoutpart45.png"
      ]
    },
Participant
December 27, 2021

And here is a screenshot of what it looks like in the PDF: