Extracting text from PDF

Forum|Forum|7 years ago
June 29, 2018
5 replies
4551 views

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Acrobat SDK and JavaScript

This topic has been closed for replies.

A

arranp65466026

Participant

it is, I did just this using VBscript, you have to use the acroPDocObj.GetJSObject javascript object

kierang28457521 wrote
Hello,
Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Dim acroAppObj : set acroAppObj = CreateObject("AcroExch.App") ' As Acrobat.AcroApp:

Dim acroADocObj : Set acroADocObj = Nothing' As Acrobat.AcroAVDoc

Dim acroPDocObj ' As Acrobat.AcroPDDoc

Dim jsObj ' As Object

Dim fsObj : Set fsObj = CreateObject("Scripting.FileSystemObject")

Set acroADocObj = acroAppObj.GetActiveDoc ' or open pdf

Set acroPDocObj = acroADocObj.GetPDDoc

Set jsObj = acroPDocObj.GetJSObject

lFileName = lFilePrefix & lAcctNumber & ".txt"

If fsObj.FileExists(lFileName) Then fsObj.DeleteFile lFileName

jsObj.SaveAs lFileName, "com.adobe.acrobat.accesstext" ' converts pdf to text

acroADocObj.Close False

Set acroADocObj = Nothing

Set acroPDocObj = Nothing

Set jsoObj = Nothing

T

Test Screen Name

Legend

Ok, no useful external JavaScript interface on Mac, but that's pretty much irrelevant as Acrobat is not for server use (neither technically nor permitted by the EULA).

What are you trying to match in text extraction - that is to say, which Acrobat function to get text are you comparing with your libraries? And what differences do you see?

T

Test Screen Name

Legend

So, do you have a subscription to Acrobat Pro, and do you have Visual Basic? Also, is this for server use?

M

markm74932579

Participant

Given that Acrobat doesn't run on Linux, this would be on a Mac. So, no, no VB. Yes, eventually, it would be server based.

M

markm74932579

Participant

I also have this need, though ideally, in Ruby. Even just a command line option would work well for our needs. We need to be able to extract all text from a given set of PDF files into text files for processing into a database.

try67

Community Expert

I don't know about Ruby, but this is a basic command that any decent PDF library probably has. I've developed standalone Java tools that can do it, for example.