Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Some text unexpectedly turned into images by the PDF Extract API

New Here ,
Dec 27, 2021 Dec 27, 2021

Hi,

 

I've been using the PDF Extarct API to display the content of a PDF as an HTML file, so far it's working well, but I'm seeing a couple of odd cases that you might be interested in. The following happens in a pdf that has sets of fill-in-the-blank questions.

 

For the folllwing text from the PDF:

 

SELECT c.name 'Channel',
COUNT(i.dataid) 'Count of Items'
FROM __________ i, __________ c
WHERE i.parentid = c.___________
AND i.____________ = 208
GROUP BY c.name, i._____________ ORDER BY COUNT(i.___________) DESC

 

The last line is extracted as an image instead of text, see below

 

Screen Shot 2021-12-27 at 11.17.27 AM.png

TOPICS
PDF Extract API
623
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 27, 2021 Dec 27, 2021

Here is the json that gets generated:

 

    {
      "Bounds": [
        231.8408966064453,
        331.61090087890625,
        435.8748016357422,
        364.3813934326172
      ],
      "Font": {
        "alt_family_name": "Courier New",
        "embedded": true,
        "encoding": "Identity-H",
        "family_name": "Courier New",
        "font_type": "CIDFontType2",
        "italic": false,
        "monospaced": false,
        "name": "EDGRTV+CourierNew,Bold",
        "subset": true,
        "weight": 700
      },
      "HasClip": false,
      "Lang": "en",
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/P[3]",
      "Text": "WHERE i.parentid = c.___________ AND i.____________ = 208 ",
      "TextSize": 10.5,
      "attributes": {
        "LineHeight": 12.5
      }
    },
    {
      "Bounds": [
        231.8199005126953,
        319.11590576171875,
        435.860107421875,
        339.3914031982422
      ],
      "Font": {
        "alt_family_name": "Courier New",
        "embedded": true,
        "encoding": "Identity-H",
        "family_name": "Courier New",
        "font_type": "CIDFontType2",
        "italic": false,
        "monospaced": false,
        "name": "EDGRTV+CourierNew,Bold",
        "subset": true,
        "weight": 700
      },
      "HasClip": false,
      "Lang": "en",
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/P[4]",
      "Text": "GROUP BY c.name, i._____________ ",
      "TextSize": 10.5,
      "attributes": {
        "LineHeight": 12.625,
        "SpaceAfter": 6.625
      }
    },
    {
      "Bounds": [
        231.8199005126953,
        306.6208953857422,
        448.57875061035156,
        326.8963928222656
      ],
      "Page": 24,
      "Path": "//Document/L[17]/LI/LBody/Figure",
      "attributes": {
        "BBox": [
          231.8309999999983,
          311.41999999999825,
          445.9019999999873,
          322.2189999999973
        ],
        "Placement": "Block"
      },
      "filePaths": [
        "figures/fileoutpart45.png"
      ]
    },
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Dec 27, 2021 Dec 27, 2021

And here is a screenshot of what it looks like in the PDF:

 

Screen Shot 2021-12-27 at 11.30.46 AM.png

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 16, 2022 Feb 16, 2022

Can you share the actual PDF. I'd like to use it to train our AI.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 23, 2022 Feb 23, 2022
LATEST

Sorry, It's a client document that I can't share.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources