Copy link to clipboard
Copied
Hi,
I've been using the PDF Extarct API to display the content of a PDF as an HTML file, so far it's working well, but I'm seeing a couple of odd cases that you might be interested in. The following happens in a pdf that has sets of fill-in-the-blank questions.
For the folllwing text from the PDF:
SELECT c.name 'Channel',
COUNT(i.dataid) 'Count of Items'
FROM __________ i, __________ c
WHERE i.parentid = c.___________
AND i.____________ = 208
GROUP BY c.name, i._____________ ORDER BY COUNT(i.___________) DESC
The last line is extracted as an image instead of text, see below
Copy link to clipboard
Copied
Here is the json that gets generated:
{
"Bounds": [
231.8408966064453,
331.61090087890625,
435.8748016357422,
364.3813934326172
],
"Font": {
"alt_family_name": "Courier New",
"embedded": true,
"encoding": "Identity-H",
"family_name": "Courier New",
"font_type": "CIDFontType2",
"italic": false,
"monospaced": false,
"name": "EDGRTV+CourierNew,Bold",
"subset": true,
"weight": 700
},
"HasClip": false,
"Lang": "en",
"Page": 24,
"Path": "//Document/L[17]/LI/LBody/P[3]",
"Text": "WHERE i.parentid = c.___________ AND i.____________ = 208 ",
"TextSize": 10.5,
"attributes": {
"LineHeight": 12.5
}
},
{
"Bounds": [
231.8199005126953,
319.11590576171875,
435.860107421875,
339.3914031982422
],
"Font": {
"alt_family_name": "Courier New",
"embedded": true,
"encoding": "Identity-H",
"family_name": "Courier New",
"font_type": "CIDFontType2",
"italic": false,
"monospaced": false,
"name": "EDGRTV+CourierNew,Bold",
"subset": true,
"weight": 700
},
"HasClip": false,
"Lang": "en",
"Page": 24,
"Path": "//Document/L[17]/LI/LBody/P[4]",
"Text": "GROUP BY c.name, i._____________ ",
"TextSize": 10.5,
"attributes": {
"LineHeight": 12.625,
"SpaceAfter": 6.625
}
},
{
"Bounds": [
231.8199005126953,
306.6208953857422,
448.57875061035156,
326.8963928222656
],
"Page": 24,
"Path": "//Document/L[17]/LI/LBody/Figure",
"attributes": {
"BBox": [
231.8309999999983,
311.41999999999825,
445.9019999999873,
322.2189999999973
],
"Placement": "Block"
},
"filePaths": [
"figures/fileoutpart45.png"
]
},
Copy link to clipboard
Copied
And here is a screenshot of what it looks like in the PDF:
Copy link to clipboard
Copied
Can you share the actual PDF. I'd like to use it to train our AI.
Copy link to clipboard
Copied
Sorry, It's a client document that I can't share.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now