Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

How to batch-convert PDF files to HTML using AppleScript (or any other means) on Mac OS

Enthusiast ,
Nov 09, 2019 Nov 09, 2019

Looking for ways to batch convert PDFs to HTML using Acrobat Pro DC
Basically I'm looking to automate the following: Acrobat > Menu: Export To > HTML Web Page > {Settings: Single HTML Page, Include Images, Recognize text if needed, Set Language}

 

This forums page from 2017 shows a promising AppleScript approach, but so far only the JPG exports are working.

 

Been chasing down AppleScript (osascript), JXA (AppleScript's flavor of JS), Acrobat JS, and Command Line, but haven't cracked it yet.

TOPICS
Acrobat SDK and JavaScript
5.8K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 09, 2019 Nov 09, 2019

None of those are necessary. You can simply use the Action Wizard that's built-in into Acrobat Pro.

Create a new Action and add to it the Save command (from the Save & Export sub-panel) and then click on Specify Settings underneath it and select the following options:

 

Snap3.pngexpand image

 

Then save your Action and run it on your PDF files to convert them to HTML files. All done!

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Enthusiast ,
Nov 09, 2019 Nov 09, 2019

Thanks try67

 

If necessary I'll go that route.  The reason I was hoping to work out an externally-coded approach: We've got thousands of files, scattered across multiple volumes.  I'd like the external code to read file list and in turn generate the HTMLs right next to the originals, ideally without ever having to open Acrobat directly.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 09, 2019 Nov 09, 2019

If you're looking for a solution that works independently of Acrobat then a forum about Acrobat is not really the right place for your question...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Enthusiast ,
Nov 09, 2019 Nov 09, 2019

I may have stated it incorrectly -- Acrobat would be open, and the code interacts with it, but it runs externally, without having to manually tend to Acrobat.

 

Check out the forums page from 2017.  It's a clean clear solution that would work perfectly, except for one minor issue: it fails in 2019.

 

I wonder if -- given the additional specs for HTML conversion (e.g. "recognize text if needed") -- osascript needs more specs in order to successfully process.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 09, 2019 Nov 09, 2019

Acrobat is not built (nor licensed) for that kind of operation, I'm afraid.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Enthusiast ,
Nov 09, 2019 Nov 09, 2019

Hmm -- works fine for JPG automation, albeit only for JPG automation, using precisely that osascript approach.  That suggests it's built for it -- or at one point was built for it.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Enthusiast ,
Nov 09, 2019 Nov 09, 2019
LATEST

Actually, I may have jumped the gun on all of this.  I had test run a bunch of documents using the manal PDF-to-HTML function, and got a nearly perfect results every time: great OCR, excellent fidelity to the original scans, wrapped in HTML lending itself to automation.  It was a lucky bunch of documennts.

 

Using your action approach to test another thousand documents, the results were mixed -- some some good, some ok, many terrible to useless.  Which is to say, OCR here in 2019, in Acrobat, tesseract, abbysoft, neat --  and maybe OCR in general is still, at best, haltingly reliable.

 

Thanks try67 for the Actions hint.  Will be useful in many other ways, but alas not for this particular effort.

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines