Scripted OCR doesn't let me script finding text, manual OCR does
Copy link to clipboard
Copied
When I script the OCRing of an image PDF, it creates bounded boxes and can't find text unless the cursor is in that particular bounded box.
However, if I manually (Enhance Scans > Recognize Text > In this file > Settings > Output = Editable Text and Images, OK) OCR the file, the findtext command works.
Document is already open when I run this VBA script:
Set aApp = CreateObject("AcroExch.App")Set aAVDoc = aApp.GetActiveDoc() Set aPageView = aAVDoc.GetAVPageView() Set aPdDoc = aAVDoc.GetPDDoc() pageCount = aPdDoc.GetNumPages ' Get PDF OCR'd For curPage = 0 To pageCount - 1 aPageView.GoTo curPage aApp.MenuItemExecute ("TouchUp:EditDocument") Next curPage rtgFound = aAVDoc.FindText("accordingly", 0, 0, 1)
rtgFound is False. If I manually OCR the document and run this code:
Set aApp = CreateObject("AcroExch.App") Set aAVDoc = aApp.GetActiveDoc() Set aPageView = aAVDoc.GetAVPageView() Set aPdDoc = aAVDoc.GetPDDoc() pageCount = aPdDoc.GetNumPages rtgFound = aAVDoc.FindText("accordingly", 0, 0, 1)
rtgFound is True. Is it possible to automate Acrobat to OCR into "Editable Text and Images"? That is currently the default UI setting, but it doesn't seem to make a difference.
If I have to search every one of the hundreds of little boxes, what would I have to loop through? Are there other options?
Many thanks!
Copy link to clipboard
Copied
As far as I know, there is no documented (and therefore supported) method to run OCR via the IAC interface. What you are trying to do is relying on a side effect of what you are executing to get the desired result. Chances are that this was never designed to work the way you are hoping it would.
There should not be any difference between running OCR manually and via trying to edit text on a page - at least as long as you are not trying to automate this last step. What is probably happening is that Acrobat has some information cached in the AVDoc that does not get updated when you trigger OCR via the menu item. I would try is to save the document, open it again, and then see if the FindText function works.
Copy link to clipboard
Copied
Unhappily saving and re-opening did not do the trick. I inserted this section before the FindText line:
curDocName = aPdDoc.GetFileName
aPdDoc.Save PDSaveFull, FilePath & curDocName
aAVDoc.Close True
aAVDoc.Open FilePath & curDocName, ""
Set aAVDoc = aApp.GetActiveDoc()
A manual save and re-open did not work either.
It would be really nice to have a supported method to automate OCR.
Copy link to clipboard
Copied
Hi Karl, I'm new to the support community so I hope I'm using the appropriate route to ask this related question:
Is there any way to have Acrobat automatically run OCR before saving the pdf? Is there a setting in the main program or is there any method available using the SDK via VBA or Python. It seems odd that the 'Sentinel' Software package for managing text files would not have a means of automating the process of OCR.
Scott
Copy link to clipboard
Copied
There does not seem to be any programming interface to OCR in Acrobat. I think this is specifically to stop attempts to use it for the sort of volume work it would be very bad at.