Extracting text from PDF

New Here ,
Jun 29, 2018 Jun 29, 2018

Copy link to clipboard

Copied

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

TOPICS
Acrobat SDK and JavaScript

Views

1.8K

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Jun 30, 2018 Jun 30, 2018

Copy link to clipboard

Copied

Do you have a subscription to Acrobat Pro and Visual Basic?

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

I also have this need, though ideally, in Ruby.  Even just a command line option would work well for our needs.  We need to be able to extract all text from a given set of PDF files into text files for processing into a database.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

I don't know about Ruby, but this is a basic command that any decent PDF library probably has. I've developed standalone Java tools that can do it, for example.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

I've tried various open source solutions, and none of them give me anything close to the output from Acrobat.  Thus my question here.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

So, do you have a subscription to Acrobat Pro, and do you have Visual Basic? Also, is this for server use?

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

Given that Acrobat doesn't run on Linux, this would be on a Mac.  So, no, no VB.  Yes, eventually, it would be server based.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Aug 14, 2018 Aug 14, 2018

Copy link to clipboard

Copied

Ok, no useful external JavaScript interface on Mac, but that's pretty much irrelevant as Acrobat is not for server use (neither technically nor permitted by the EULA).

What are you trying to match in text extraction - that is to say, which Acrobat function to get text are you comparing with your libraries? And what differences do you see?

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Aug 28, 2018 Aug 28, 2018

Copy link to clipboard

Copied

it is, I did just this using VBscript, you have to use the acroPDocObj.GetJSObject javascript object

kierang28457521  wrote

Hello,

Is there a way of getting a full dump of a given PDF's text using some sort of API? Is this API available in Python? My goal is the iterate through a directory of PDF files are extract the text from every single file.

Dim acroAppObj : set acroAppObj = CreateObject("AcroExch.App") ' As Acrobat.AcroApp:  

Dim acroADocObj : Set acroADocObj = Nothing' As Acrobat.AcroAVDoc

Dim acroPDocObj ' As Acrobat.AcroPDDoc

Dim jsObj ' As Object

Dim fsObj : Set fsObj = CreateObject("Scripting.FileSystemObject")

Set acroADocObj = acroAppObj.GetActiveDoc   ' or open pdf

Set acroPDocObj = acroADocObj.GetPDDoc

Set jsObj = acroPDocObj.GetJSObject

lFileName = lFilePrefix & lAcctNumber & ".txt"

If fsObj.FileExists(lFileName) Then fsObj.DeleteFile lFileName

jsObj.SaveAs lFileName, "com.adobe.acrobat.accesstext"    ' converts pdf to text

acroADocObj.Close False

Set acroADocObj = Nothing

Set acroPDocObj = Nothing

Set jsoObj = Nothing

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines