Skip to main content
Mark ORIMBELLI LLC
Inspiring
February 26, 2022
Question

PDF generated with unreadable characters for Googlebot crawling (and for copy-pasting)

  • February 26, 2022
  • 4 replies
  • 515 views

Source: InDesign 2022 documents

Output: PDF (any format) with Embedded Fonts (western alphabets)

Problem: On screen and print the PDF appears OK, when crawled by Googlebot the text is a bounch of "garbage" unreadable charaters.

Note to readers: Please do not suggest the "Copy-With-Formatting" option as solution. I'm talking of Google crawling and search indexing in this post.

Link to PDF as example.

This topic has been closed for replies.

4 replies

Community Expert
February 28, 2022

Hi Mark,

just to make it clear, your solution was to enable the option:

[x] Create Tagged PDF

 

Thanks,
Uwe Laubender

( ACP )

Mark ORIMBELLI LLC
Inspiring
March 1, 2022

Exactly.

You need to enable the option: [x] Create Tagged PDF

to obtain a PDF that is correctly readable, and therefore indexable, from Googlebot & friends.

 

Dave Creamer of IDEAS
Community Expert
March 1, 2022

That feature is enabled by default. The PDF sample you uploaded was tagged. 

David Creamer: Community Expert (ACI and ACE 1995-2023)
Mark ORIMBELLI LLC
Inspiring
February 27, 2022

UPDATE: the issue seems solved flagging "Create Tagged PDF"

Dave Creamer of IDEAS
Community Expert
February 26, 2022

I exported to Word and it came out with a lot of "garbage" characters. I suspect it is the font--as a test, try another font, such as an Adobe font.

David Creamer: Community Expert (ACI and ACE 1995-2023)
James Gifford—NitroPress
Brainiac
February 27, 2022

Ah, didn't think to try that (but then, I am wary of downloading and messing with files, even here in a fairly safe zone). Still not sure how a font, which AFAIK is only called on at rendering/display time, could mangle the text that is in theory more clear at a bot-search level.

 

Strange. I've never heard of an unreadable PDF, in English, and that works in every other way.

 

James Gifford—NitroPress
Brainiac
February 26, 2022

Quite honestly, if it exports correctly and views correctly in Acrobat, I'd think it's one of two things.

First, the embedded fonts are encrypted in standard Adobe practice, and that's somehow confusing Google, even though it should be reading the raw text and ignoring things like font and layout.

 

Second... it's Google's problem. 🙂