Discrepancy between Adobe Extract PDF and the PDF content

Report · Oct 18, 2022

I am a newbie in PDF and I would appreciate having more explanation about how the Adobe SDK API works through the Adobe sample extractPDF using the class ExtractTextInfoFromPDF.java.

I have a source PDF that contains this definition:

7 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobj

Including the following text sequence:

BT
3 Tr
0.00 Tc
/F3 10.5 Tf
1 0 0 1 302.16 776.64 Tm
<i,/ILLENEUVE > Tj
ET

And when I run the extractPDF sample via the Adobe API to get the TEXT info, I get this:

"Font": {
				"alt_family_name": "* Titlingmes New Roman",
				"embedded": true,
				"encoding": "Identity-H",
				"family_name": "* Titlingmes New Roman",
				"font_type": "CIDFontType0",
				"italic": false,
				"monospaced": false,
				"name": "*Times New Roman-Bold-3921",
				"subset": false,
				"weight": 700
			},
			"HasClip": false,
			"Lang": "fr",
			"Page": 0,
			"Path": "//Document/Sect/P",
			"Text": "VILLENEUVE ",
			"TextSize": 10.0

As you can see, the API has correctly translated "i,/" (3 characters, unless '/' in this sequence has a special meaning) into the "V" character ?

The PDF has been generated using a CANON scanner with OCR/Tagging as the search capabilty is available ont this document, except when searching for "VILLENEUVE".

It must be noted that when opening the PDF for display, the "V" letter is not clearly displayed ...

Can someone explain me the mystery (TEXT correctly extracted using the Adobe ExtractPDF API) ?

Thanks, Eric

Report · Oct 21, 2022

I assume that if the V Letter is not clearly displayed, it is because the PDF Showing process is doing some extra translation/conversion.

Thanks ...

Report · Oct 21, 2022

MOVED TO THE ACROBAT SDK DISCUSSIONS

Report · Oct 21, 2022

Unfortunately, this is now still in the wrong forum. The Extract SDK is a web based service, not connected to the Acrobat SDK. Correct forum: Document Services APIs - Adobe Support Community

Report · Nov 23, 2022

Thanks, any link to the Extract SDK ?

Report · Nov 23, 2022

OK got it ... on the PDF Extract API page ... but confused by the fact that it is part of the PDF services API ...

Report · Nov 23, 2022

It's a service, not an app, so this seems the right home for it. Files are sent to Adobe's servers for conversion. The Acrobat SDK does not use the web, but uses a local subscription to Adobe Acrobat. Adobe actually have three different teams working on PDF extraction (or maybe four if you include Adobe PDF Library).

Report · Dec 13, 2022

To add to the confusion, without any explanation the Document Services API forum - which referred to itself as the Document Cloud SDK - has been renamed as the Acrobat Services API. So this is a set of services which are not included with Acrobat, and not connected to the Acrobat SDK or the API offered by Acrobat, but now called Acrobat API. What could possibly go wrong?

Report · Dec 06, 2022

As you can see, the API has correctly translated "i,/" (3 characters, unless '/' in this sequence has a special meaning) into the "V" character ?
The PDF has been generated using a CANON scanner with OCR/Tagging as the search capabilty is available ont this document, except when searching for "VILLENEUVE".
It must be noted that when opening the PDF for display, the "V" letter is not clearly displayed ...
Can someone explain me the mystery (TEXT correctly extracted using the Adobe ExtractPDF API) ?

Apparently the CANON scanner OCR machine could not identify the (unclear) "V" letter correctly but instead recognized an "i", a comma and a slash, and added these characters as text invisibly to the scanned image. As a consequence, "VILLENEUVE" cannot be found. (But have you tried searching for "i,/ILLENEUVE"? That might well work...)

The ExtractPDF API you called, on the other hand, appears to have ignored the text data from the CANON OCR machine but instead applied its own OCR routines to the scanned image of the page. These OCR routines appear to have been more successful and have recognized the "V".

Report · Dec 07, 2022

"Unfortunately, this is now still in the wrong forum. The Extract SDK is a web based service, not connected to the Acrobat SDK. Correct forum: Document Services APIs - Adobe Support Community"

==> Moved to the Document Services APIs - Adobe Support Community discussions

Adobe Community

Discrepancy between Adobe Extract PDF and the PDF content