How to batch-convert PDF files to HTML using AppleScript (or any other means) on Mac OS

Report · Nov 09, 2019

Looking for ways to batch convert PDFs to HTML using Acrobat Pro DC
Basically I'm looking to automate the following: Acrobat > Menu: Export To > HTML Web Page > {Settings: Single HTML Page, Include Images, Recognize text if needed, Set Language}

This forums page from 2017 shows a promising AppleScript approach, but so far only the JPG exports are working.

Been chasing down AppleScript (osascript), JXA (AppleScript's flavor of JS), Acrobat JS, and Command Line, but haven't cracked it yet.

Report · Nov 09, 2019

None of those are necessary. You can simply use the Action Wizard that's built-in into Acrobat Pro.

Create a new Action and add to it the Save command (from the Save & Export sub-panel) and then click on Specify Settings underneath it and select the following options:

Then save your Action and run it on your PDF files to convert them to HTML files. All done!

Report · Nov 09, 2019

Thanks try67

If necessary I'll go that route. The reason I was hoping to work out an externally-coded approach: We've got thousands of files, scattered across multiple volumes. I'd like the external code to read file list and in turn generate the HTMLs right next to the originals, ideally without ever having to open Acrobat directly.

Report · Nov 09, 2019

If you're looking for a solution that works independently of Acrobat then a forum about Acrobat is not really the right place for your question...

Report · Nov 09, 2019

I may have stated it incorrectly -- Acrobat would be open, and the code interacts with it, but it runs externally, without having to manually tend to Acrobat.

Check out the forums page from 2017. It's a clean clear solution that would work perfectly, except for one minor issue: it fails in 2019.

I wonder if -- given the additional specs for HTML conversion (e.g. "recognize text if needed") -- osascript needs more specs in order to successfully process.

Report · Nov 09, 2019

Acrobat is not built (nor licensed) for that kind of operation, I'm afraid.

Report · Nov 09, 2019

Hmm -- works fine for JPG automation, albeit only for JPG automation, using precisely that osascript approach. That suggests it's built for it -- or at one point was built for it.

Report · Nov 09, 2019

Actually, I may have jumped the gun on all of this. I had test run a bunch of documents using the manal PDF-to-HTML function, and got a nearly perfect results every time: great OCR, excellent fidelity to the original scans, wrapped in HTML lending itself to automation. It was a lucky bunch of documennts.

Using your action approach to test another thousand documents, the results were mixed -- some some good, some ok, many terrible to useless. Which is to say, OCR here in 2019, in Acrobat, tesseract, abbysoft, neat -- and maybe OCR in general is still, at best, haltingly reliable.

Thanks try67 for the Actions hint. Will be useful in many other ways, but alas not for this particular effort.