Copy link to clipboard
Copied
Hello,
Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.
Copy link to clipboard
Copied
Do you have a subscription to Acrobat Pro and Visual Basic?
Copy link to clipboard
Copied
I also have this need, though ideally, in Ruby. Even just a command line option would work well for our needs. We need to be able to extract all text from a given set of PDF files into text files for processing into a database.
Copy link to clipboard
Copied
I don't know about Ruby, but this is a basic command that any decent PDF library probably has. I've developed standalone Java tools that can do it, for example.
Copy link to clipboard
Copied
I've tried various open source solutions, and none of them give me anything close to the output from Acrobat. Thus my question here.
Copy link to clipboard
Copied
So, do you have a subscription to Acrobat Pro, and do you have Visual Basic? Also, is this for server use?
Copy link to clipboard
Copied
Given that Acrobat doesn't run on Linux, this would be on a Mac. So, no, no VB. Yes, eventually, it would be server based.
Copy link to clipboard
Copied
Ok, no useful external JavaScript interface on Mac, but that's pretty much irrelevant as Acrobat is not for server use (neither technically nor permitted by the EULA).
What are you trying to match in text extraction - that is to say, which Acrobat function to get text are you comparing with your libraries? And what differences do you see?
Copy link to clipboard
Copied
it is, I did just this using VBscript, you have to use the acroPDocObj.GetJSObject javascript object
kierang28457521 wrote
Hello,
Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.
Dim acroAppObj : set acroAppObj = CreateObject("AcroExch.App") ' As Acrobat.AcroApp:
Dim acroADocObj : Set acroADocObj = Nothing' As Acrobat.AcroAVDoc
Dim acroPDocObj ' As Acrobat.AcroPDDoc
Dim jsObj ' As Object
Dim fsObj : Set fsObj = CreateObject("Scripting.FileSystemObject")
Set acroADocObj = acroAppObj.GetActiveDoc ' or open pdf
Set acroPDocObj = acroADocObj.GetPDDoc
Set jsObj = acroPDocObj.GetJSObject
lFileName = lFilePrefix & lAcctNumber & ".txt"
If fsObj.FileExists(lFileName) Then fsObj.DeleteFile lFileName
jsObj.SaveAs lFileName, "com.adobe.acrobat.accesstext" ' converts pdf to text
acroADocObj.Close False
Set acroADocObj = Nothing
Set acroPDocObj = Nothing
Set jsoObj = Nothing