Extracting text from PDF

Report · Jun 29, 2018

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Report · Jun 30, 2018

Do you have a subscription to Acrobat Pro and Visual Basic?

Report · Aug 14, 2018

I also have this need, though ideally, in Ruby. Even just a command line option would work well for our needs. We need to be able to extract all text from a given set of PDF files into text files for processing into a database.

Report · Aug 14, 2018

I don't know about Ruby, but this is a basic command that any decent PDF library probably has. I've developed standalone Java tools that can do it, for example.

Report · Aug 14, 2018

I've tried various open source solutions, and none of them give me anything close to the output from Acrobat. Thus my question here.

Report · Aug 14, 2018

So, do you have a subscription to Acrobat Pro, and do you have Visual Basic? Also, is this for server use?

Report · Aug 14, 2018

Given that Acrobat doesn't run on Linux, this would be on a Mac. So, no, no VB. Yes, eventually, it would be server based.

Report · Aug 14, 2018

Ok, no useful external JavaScript interface on Mac, but that's pretty much irrelevant as Acrobat is not for server use (neither technically nor permitted by the EULA).

What are you trying to match in text extraction - that is to say, which Acrobat function to get text are you comparing with your libraries? And what differences do you see?

Report · Aug 28, 2018

it is, I did just this using VBscript, you have to use the acroPDocObj.GetJSObject javascript object

kierang28457521 wrote
Hello,
Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Dim acroAppObj : set acroAppObj = CreateObject("AcroExch.App") ' As Acrobat.AcroApp:

Dim acroADocObj : Set acroADocObj = Nothing' As Acrobat.AcroAVDoc

Dim acroPDocObj ' As Acrobat.AcroPDDoc

Dim jsObj ' As Object

Dim fsObj : Set fsObj = CreateObject("Scripting.FileSystemObject")

Set acroADocObj = acroAppObj.GetActiveDoc ' or open pdf

Set acroPDocObj = acroADocObj.GetPDDoc

Set jsObj = acroPDocObj.GetJSObject

lFileName = lFilePrefix & lAcctNumber & ".txt"

If fsObj.FileExists(lFileName) Then fsObj.DeleteFile lFileName

jsObj.SaveAs lFileName, "com.adobe.acrobat.accesstext" ' converts pdf to text

acroADocObj.Close False

Set acroADocObj = Nothing

Set acroPDocObj = Nothing

Set jsoObj = Nothing

Adobe Community

Extracting text from PDF