Extract PDF API returns BAD_PDF_UNSUPPORTED_FONT

Report · Apr 15, 2024

I've created a PDF using Adobe InDesign and was attempting to extract the text from the PDF using the API. However, I'm getting the following error:

Known exception encountered while executing operation ServiceApiError: BAD_PDF - Unable to extract content.: The input file contains font data that is corrupted or not supported

Then further down:

BAD_PDF_UNSUPPORTED_FONT

When I use the Export PDF API to turn the PDF file into a MSWord .docx file, then use Word to print to PDF, and try the Extract PDF API on the modified pdf file, I dont encounter the same problem.

Does anyone know of a way to make the Extract API more forgiving? IE Allowing me to get the desired result without jumping through additional hoops? Or why the original PDF has been generated in a way that the Extract API doesnt like the font metadata.

Report · Apr 15, 2024

Can you share the PDF in question?

Report · Apr 15, 2024

Hi Joel, I've added a copy of the knitting pattern called "flax sweater" as a separater reply to my original post.

Report · Apr 15, 2024

An example PDF is attached.

Report · Apr 17, 2024

There are, in fact, a ton of font errors in this PDF. Exporting to Word uses a different tool which is why it can be read.

Report · Apr 22, 2024

Thanks for taking a look Joel. There's a couple of things that don't seem right to me about this:

Why would InDesign allow the creation of a PDF file with so many errors in it?
And why does the Extract API care about the errors?
- It's primary purpose is to extract text from the PDF so refusing to handle a PDF that looks valid to an end user doesnt feel like a robust enough system to be generally useful.

Report · Apr 23, 2024

I wish I had answers for you. Fonts are really complicated.

Extract PDF API returns BAD_PDF_UNSUPPORTED_FONT

1 Correct answer