• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Why this pdf is not searchable? Not scanned, maybe vector, but some text included...

Community Beginner ,
Dec 20, 2024 Dec 20, 2024

Copy link to clipboard

Copied

From time to time, I meet pdf like this one:

https://www.warco.co.uk/img/cms/WM14%20Operators%20Manual%20&%20Parts%20List_1.pdf

When I try  to find let say the word MANUAL, it does not find it.

 

manual_unsearchable.png

 

When I try to copy the word MANUAL and paste it to the search windows, it is not readable.

 

manual_unsearchable_2.png

 

When I try File-Save as Text, the result is unreadable.

What is wrong with this pdf? Can I repair it somehow?

TOPICS
Edit and convert PDFs , General troubleshooting , Scan documents and OCR

Views

126

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 20, 2024 Dec 20, 2024

Copy link to clipboard

Copied

Bad font encoding. Export all pages to (high-quality) images, such as PNG, then create a new PDF from those images and run Text Recognition on it.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 20, 2024 Dec 20, 2024

Copy link to clipboard

Copied

"Can I repair it somehow?"

No. Most of the text in this file has no correlation to proper Unicode character pairing, so the embedded subset can only be used for printing. What's making it worse is that it looks like it has been re-encoded and re-subsetted from already subsetted fonts, making it even further from recognizable. Any text copied and pasted or exported will indeed be gibberrish (with the exception of page 3 which has a standard encoding). This cannot be fixed at this point; the damage is done; The file would have to be recreated from its original sources (assuming this is a file you/your company created).

If this is not your PDF and YOU just want to make it searchable for your own purposes, the route @try67 suggested would indeed work

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 22, 2024 Dec 22, 2024

Copy link to clipboard

Copied

Thanks. https://tools.pdf24.org/en/ocr-pdf  or https://www.ilovepdf.com/ocr-pdf did it for me, but each of them quite different way.

The first made it searchable, but I still cannot copy readable text from the converted pdf. The second made it searchable and text is without gibberish, but the file size expanded from 1.78 MB to 5.95 MB.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 22, 2024 Dec 22, 2024

Copy link to clipboard

Copied

The I Love PDF one is doing essentially what @try67 has suggested. It converts each page to an image them performs OCR on those images. The OCR's text is hidden, but accessible for search and copy and paste operations, so what you SEE is the rasterized image (in I Love PDF's case, around 150ppi), but the text is from the hidden layer. As for the file size: Because your converted PDF is now all images, the file size will increase accordingly.

I tried to see what the other does, but it really did not provide me with any better of a PDF than your original; i.e. The only text that seasonable was page 3

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 23, 2024 Dec 23, 2024

Copy link to clipboard

Copied

LATEST
quote

it really did not provide me with any better of a PDF than your original; i.e. The only text that seasonable was page 3

 

I had to check "Force OCR" (and English), only then it made completely searchable pdf.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines