Skip to main content
kierang28457521
Participant
June 29, 2018
Question

Extracting text from PDF

  • June 29, 2018
  • 5 replies
  • 4537 views

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

This topic has been closed for replies.

5 replies

Participant
August 29, 2018

it is, I did just this using VBscript, you have to use the acroPDocObj.GetJSObject javascript object

kierang28457521  wrote

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Dim acroAppObj : set acroAppObj = CreateObject("AcroExch.App") ' As Acrobat.AcroApp:  

Dim acroADocObj : Set acroADocObj = Nothing' As Acrobat.AcroAVDoc

Dim acroPDocObj ' As Acrobat.AcroPDDoc

Dim jsObj ' As Object

Dim fsObj : Set fsObj = CreateObject("Scripting.FileSystemObject")

Set acroADocObj = acroAppObj.GetActiveDoc   ' or open pdf

Set acroPDocObj = acroADocObj.GetPDDoc

Set jsObj = acroPDocObj.GetJSObject

lFileName = lFilePrefix & lAcctNumber & ".txt"

If fsObj.FileExists(lFileName) Then fsObj.DeleteFile lFileName

jsObj.SaveAs lFileName, "com.adobe.acrobat.accesstext"    ' converts pdf to text

acroADocObj.Close False

Set acroADocObj = Nothing

Set acroPDocObj = Nothing

Set jsoObj = Nothing

Legend
August 14, 2018

Ok, no useful external JavaScript interface on Mac, but that's pretty much irrelevant as Acrobat is not for server use (neither technically nor permitted by the EULA).

What are you trying to match in text extraction - that is to say, which Acrobat function to get text are you comparing with your libraries? And what differences do you see?

Legend
August 14, 2018

So, do you have a subscription to Acrobat Pro, and do you have Visual Basic? Also, is this for server use?

Participant
August 14, 2018

Given that Acrobat doesn't run on Linux, this would be on a Mac.  So, no, no VB.  Yes, eventually, it would be server based.

Participant
August 14, 2018

I also have this need, though ideally, in Ruby.  Even just a command line option would work well for our needs.  We need to be able to extract all text from a given set of PDF files into text files for processing into a database.

try67
Community Expert
Community Expert
August 14, 2018

I don't know about Ruby, but this is a basic command that any decent PDF library probably has. I've developed standalone Java tools that can do it, for example.

Participant
August 14, 2018

I've tried various open source solutions, and none of them give me anything close to the output from Acrobat.  Thus my question here.

Legend
June 30, 2018

Do you have a subscription to Acrobat Pro and Visual Basic?