Highlighted

Visible text in PDF is fullcaps/uppercase but the extracted text has random letters in lowercase

Community Beginner ,
Sep 19, 2020

Copy link to clipboard

Copied

Hello. I received a PDF from an external source. The PDF contains some test in uppercase.

When I extract this text by copying and pasting it to Word or Notepad, the words that appeared in uppercase in the PDF contain random lowercase letters in Word/Notepad

 

For example:

 

Visible in the PDF:

THIS IS A TEST

 

After copying to Word/Notepate:

ThiS iS A TeST

 

Does anyone know why this might be happening? Thanks

Adobe Community Professional
Correct answer by Bevi_Chagnon___PubCom | Adobe Community Professional

It's because some of the text was deliberately typed in all caps in the source document (such typing with the Shift Key pessed down in MS Word), other parts used the Caps/Lowercase icon in the ribbon bar to make it caps.

 

Capital characters are different glyphs than lowercase letters: see this basic Unicode character chart https://www.unicode.org/charts/PDF/U0000.pdf   Capital A = codepoint 0041, and lowercase a = codepoint 0061.

 

Some word processing programs retain the originally created CAPS / lowercase letters as they were typed, but change their appearance (not the actual character) with the "case toggle" icon.

Often changes only the APPEARANCE of the letters, not the actual characters.Often changes only the APPEARANCE of the letters, not the actual characters.

 

When the PDF is exported, it retains the original case and the rendered appearance in the PDF.

And when you later extract the content from the PDF, the original case and appearance is also retained.

In your sample, it's clear that the original author was not consistent in how they typed the actual letters, and that was masked by the case toggle icon, so that all the letters appeared as CAPS, but in reality weren't.

 

Bottom line: that's how the content was originally typed, and it was carried through all of the file variations.

 

TOPICS
Edit and convert PDFs

Views

75

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Visible text in PDF is fullcaps/uppercase but the extracted text has random letters in lowercase

Community Beginner ,
Sep 19, 2020

Copy link to clipboard

Copied

Hello. I received a PDF from an external source. The PDF contains some test in uppercase.

When I extract this text by copying and pasting it to Word or Notepad, the words that appeared in uppercase in the PDF contain random lowercase letters in Word/Notepad

 

For example:

 

Visible in the PDF:

THIS IS A TEST

 

After copying to Word/Notepate:

ThiS iS A TeST

 

Does anyone know why this might be happening? Thanks

Adobe Community Professional
Correct answer by Bevi_Chagnon___PubCom | Adobe Community Professional

It's because some of the text was deliberately typed in all caps in the source document (such typing with the Shift Key pessed down in MS Word), other parts used the Caps/Lowercase icon in the ribbon bar to make it caps.

 

Capital characters are different glyphs than lowercase letters: see this basic Unicode character chart https://www.unicode.org/charts/PDF/U0000.pdf   Capital A = codepoint 0041, and lowercase a = codepoint 0061.

 

Some word processing programs retain the originally created CAPS / lowercase letters as they were typed, but change their appearance (not the actual character) with the "case toggle" icon.

Often changes only the APPEARANCE of the letters, not the actual characters.Often changes only the APPEARANCE of the letters, not the actual characters.

 

When the PDF is exported, it retains the original case and the rendered appearance in the PDF.

And when you later extract the content from the PDF, the original case and appearance is also retained.

In your sample, it's clear that the original author was not consistent in how they typed the actual letters, and that was masked by the case toggle icon, so that all the letters appeared as CAPS, but in reality weren't.

 

Bottom line: that's how the content was originally typed, and it was carried through all of the file variations.

 

TOPICS
Edit and convert PDFs

Views

76

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Sep 19, 2020 0
Adobe Community Professional ,
Sep 19, 2020

Copy link to clipboard

Copied

Hi Thomas,

 

Yeah, the original came from a printed document that was scanned and then OCRed. Since I/we do not know what the original document looked like we do not know how good of an original document it was, we do not know the resolution it was scanned at, and we do not know the software that generated the PDF or did the OCR.

 

So, the short answer is I don't know.

 

If you are exporting the text out you can easily fix the text by using Word's all caps, Title Case, lower case, whatever you need.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 19, 2020 1
Adobe Community Professional ,
Sep 19, 2020

Copy link to clipboard

Copied

It's because some of the text was deliberately typed in all caps in the source document (such typing with the Shift Key pessed down in MS Word), other parts used the Caps/Lowercase icon in the ribbon bar to make it caps.

 

Capital characters are different glyphs than lowercase letters: see this basic Unicode character chart https://www.unicode.org/charts/PDF/U0000.pdf   Capital A = codepoint 0041, and lowercase a = codepoint 0061.

 

Some word processing programs retain the originally created CAPS / lowercase letters as they were typed, but change their appearance (not the actual character) with the "case toggle" icon.

Often changes only the APPEARANCE of the letters, not the actual characters.Often changes only the APPEARANCE of the letters, not the actual characters.

 

When the PDF is exported, it retains the original case and the rendered appearance in the PDF.

And when you later extract the content from the PDF, the original case and appearance is also retained.

In your sample, it's clear that the original author was not consistent in how they typed the actual letters, and that was masked by the case toggle icon, so that all the letters appeared as CAPS, but in reality weren't.

 

Bottom line: that's how the content was originally typed, and it was carried through all of the file variations.

 

Bevi Chagnon | Designer & Technologist for Accessible InDesign + PDFs | Books @ www.PubCom.com/books — NEW! Accessible InDesign + PDF

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 19, 2020 1
Community Beginner ,
Sep 21, 2020

Copy link to clipboard

Copied

Thank you both for really great answers. I thought my question was unanswerable!

One other thing I have noticed. If I edit the text in the PDF and change the font (I don't do anything else apart from changing the font), the problem goes away (i.e. the text is in all caps when I copy into Word).
Does that extra bit of info change anything in your opinion?

Thanks again for all your help.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 21, 2020 0
Adobe Community Professional ,
Sep 21, 2020

Copy link to clipboard

Copied

Yes, that confirms to me that Bevi is correct!

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 21, 2020 1
Community Beginner ,
Sep 21, 2020

Copy link to clipboard

Copied

Thank you very much.

 

Thomas

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Sep 21, 2020 0