Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Check whether a document is searchable

New Here ,
Oct 04, 2019 Oct 04, 2019

Hello, I've been trying to check whether a pdf is searchable, and if it's not, automatically do OCR Action to the document. Until now, I've found two ways of doing it (not efficiently). My question is about the function search.available. I read in the Acrobat JavaScript Scripting Guide that the function can determine if searching is possible. But when I used that to a pdf that is unsearchable, its result was confusing. It said "true" even though it was clearly false. I'm very new to JavaScript, can you guys show me how to use the function in a correct way? Or do you have any idea on how to check whether a document is searchable other than the activation of Read Out Loud function and search for specific word(s)?

 

Thankyou very much! 🙂

TOPICS
How to , Scan documents and OCR , Standards and accessibility
1.7K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 04, 2019 Oct 04, 2019

If you check out the JavaScript API Reference you'll see that search.available doesn't do that at all.

"Returns true if the Search plug-in is loaded and query capabilities are possible. A script author should check this Boolean before performing a query or other search object manipulation."

So it's checking if a plug-in is loaded, not looking at the document. Search, indeed, is different from Find; it's about using indexes that might exist for fast searching. This is far from obvious in the current version.

 

"Searchable" is a word often used, but it has no particular meaning for PDFs. Certainly, PDFs don't have any information in them saying they are "searchable". Here are some possible meanings:

* A file might be called "searchable" if it contains text that extracts to words in your own language.

* A file might be called "searchable" if it contains any text at all, even if it is garbage.

* A file might be called "searchable" if it contains images of text, on which OCR would succeed (since Acrobat Pro may do that if searching).

 

What you can readily do is use doc.getPageNumWords against each page to see if there is any text (the second case); the others are not something you could do in JavaScript.

 

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 05, 2019 Oct 05, 2019

T/here is no guarantee that even if there are searchable words on a page that all word images on the page can be searched.  A PDF can consist of word and images and images can have words with them but those words will still be an image.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 10, 2025 Apr 10, 2025

Hello.

I am facing the same conundrum now, I know a have a few thousand personal medical documents but only a year ago or so I discovered a feature of my Brother scanner that allows me to include an OCR step for every document I scan.

This means some documents have some extra information on top of the picture and some don't.

As you can imagine I would like to know what are the documents that only contain a picture and which can be searched for OCR-ized content. As a side note I have another app that allows me to re-process any .pdf and perform this OCR step individually and I would like to use it but first I must find what I am facing here with...as one can imagine to open one document at a time and see if I can search for any of the words inside is really not reasonable to try.

 

I'm looking forward to getting ideas or methods to achieve this goal, thank you in advance.

 

I'm not a programmer so unless I'm provided with a finished and tested program I can't create one or use any of the functions discussed above.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 11, 2025 Apr 11, 2025

Anybody, anything?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 22, 2025 Apr 22, 2025

This forum is dormant...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 23, 2025 Apr 23, 2025

Hi @aka_1894 ,

 

Thank you for updatig this dormant thread.

 

In your particular case, is not possible, I'm afraid. Since you're using a third party tool that won't interface with Acrobat in the context that you've described, it may be beneficial to consult with Brother customer support for a bulk processing solution.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 03, 2025 May 03, 2025

Hello.

Thank you for yor reply but...I have the feeling that you misunderstood what I wrote above or maybe I can't imagine how could Brother support help me in this situation considering the fact I can't rescan the documents.

Now I am left with is a few thousand .pdfs out of which some have the result of OCR step embeded in the .pdf document and some only contain the picture.

That is why I need a tool that can analyze all .pdf files and indicate which contains an embeded text resulted from the OCR and which only contain the picture.

Thank you.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 04, 2025 May 04, 2025

Hello, I appreciate your inquiry.

 

While I may not have all the details, I noted your mention of a feature in your Brother scanner that enables an OCR step for each document scanned.

 

I suggest reaching out to the manufacturer directly to explore how to access the scripting commands that could facilitate this process in bulk. Could you please clarify which specific OCR step and Brother scanning device you are referring to?

 

It would be helpful to have more precise information.

 

From my understanding, Adobe Acrobat Pro alone does not support the functionality you are seeking. Unfortunately, it lacks the capability to create a batch sequence for analyzing thousands of PDFs simultaneously, and you may encounter limitations based on your operating system as well.

 

However, you can utilize Acrobat's Print Production tool by navigating to Preflight and then Options ===>>> "Browse the internal structure of all document fonts".

 

Or you may also taylor the Action Wizard to create an automated action: Open the Action Wizard Tool => selecte New Action => add then choose Preflight tool => Next, choose a folder where you placed the PDFs to be analyzed => click on Select Folder  => Save ==> add an Action Name (i.e. "Analyze Font Structure on PDFs") => click Save.


This Preflight tool feature will help you determine if a scanned document contains embedded fonts, which is common when OCR has been applied; if OCR wasn't applied it will show nothing.

 

Please note that this process must be done individually in Acrobat, as bulk processing is not an option. That said, if you search online, you might discover third-party tools and Python scripts that can perform the bulk operations you need.

 

But what you are asking for, may demand simplicity rather than complicating ourselves too much; for instance, I have a straightforward batch script that can be utilized on Windows machines, and I was able to test it with 600 PDF documents, processing them in under 30 seconds.

 

The script carries out several functions.

 

It designates a source folder located on your Desktop, assuming for this example that you have already established a folder named 'SCANNED_PDFS' there and manually moved the desired PDFs that you would like to process. Upon execution, the script creates a subfolder titled 'OCRed_PDFs'.

 

Subsequently, it looks for the 'FontDescriptor' text string within the PDF structure of documents; if such documents were processed by an OCR tool it will identify the 'FontDescriptor' string in them and move those PDFs to the 'OCRed_PDFs" subfolder .

 

If a document is merely a scanned PDF it will not be moved, since it lacks the FontDescriptor object. This will allow you to keep PDFs that were OCR'ed on a separate folder while the files that are just scans remain intact in their source folder for further OCR processing (which you can do with Acrobat using the Scan&OCR Tool or with a third-party command-line batch script.

 

NOTE:

I am not very savvy with advanced batch scripting, so this script will only move PDFs files from one parent folder to a subfolder as long as the PDF file names doesn't include spaces. If the file names include space, the scrcipt will not process them and ignore them.

 

For further clarification on what this script achieves, please refer to the comparison presented in my slides below, where two PDFs are analyzed using a text editor (Also note that I am not evaluating Accessibility, Tag structure or any kind of XMP Metadata, and much less following any kind of PDF specifications according to ISO standards), focusing solely on text strings:

 

OCR.png

 

Here is a copy of the script:

 

@ ECHO OFF

set "source=C:\Users\UserName\Desktop\SCANNED_PDFS"
cd "%source%"

mkdir C:\Users\UserName\Desktop\SCANNED_PDFS\OCRed_PDFs

for /f %%A in ('findstr /M "FontDescriptor" *.pdf') DO MOVE "%%A" "C:\Users\UserName\Desktop\SCANNED_PDFS\OCRed_PDFs"

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 04, 2025 May 04, 2025

Hello.

Thank you for your swift reply and also for the script, it is what I was looking for.

The device is Brother DCP-T520W and once swithched on the OCR step stays like this for all following scanned documents.

I shall give your script a try, thank you much for posting it here, I am careful to avoid spaces in file names.

Best regards.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 05, 2025 May 05, 2025
LATEST

You're welcome.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines