OCR Data extraction

New Here ,
Jul 04, 2017

Copy link to clipboard

Copied

Hello,

I many PDFs that I read in using Acrobat DC and I always need the top table on the second page of each PDF, I store these values in excel and later into a database.

Is there  a way to automate this with javascript?

Ideally I just need a way to quickly get data from the top table on the second page.

thanks

Most Valuable Participant
Correct answer by Dave Merchant | Most Valuable Participant

A reliable solution is very unlikely. In the API, the only way you can 'read' a word from a page is the doc.getPageNthWord() method, which only gets one word, based on the page's internal content ordering.

In theory if every table was identical you could read each word separately and rebuild them into a string, but when you OCR a document the concept of word order, and word breaks, is variable to say the least. If your word counts are different in each document you'd have no idea how many to read, and there's no way in JavaScript to work out what is and is not a table cell; you're not dealing with HTML.

TOPICS
Acrobat SDK and JavaScript

Views

1.3K

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

OCR Data extraction

New Here ,
Jul 04, 2017

Copy link to clipboard

Copied

Hello,

I many PDFs that I read in using Acrobat DC and I always need the top table on the second page of each PDF, I store these values in excel and later into a database.

Is there  a way to automate this with javascript?

Ideally I just need a way to quickly get data from the top table on the second page.

thanks

Most Valuable Participant
Correct answer by Dave Merchant | Most Valuable Participant

A reliable solution is very unlikely. In the API, the only way you can 'read' a word from a page is the doc.getPageNthWord() method, which only gets one word, based on the page's internal content ordering.

In theory if every table was identical you could read each word separately and rebuild them into a string, but when you OCR a document the concept of word order, and word breaks, is variable to say the least. If your word counts are different in each document you'd have no idea how many to read, and there's no way in JavaScript to work out what is and is not a table cell; you're not dealing with HTML.

TOPICS
Acrobat SDK and JavaScript

Views

1.3K

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Jul 04, 2017 0
Most Valuable Participant ,
Jul 04, 2017

Copy link to clipboard

Copied

A reliable solution is very unlikely. In the API, the only way you can 'read' a word from a page is the doc.getPageNthWord() method, which only gets one word, based on the page's internal content ordering.

In theory if every table was identical you could read each word separately and rebuild them into a string, but when you OCR a document the concept of word order, and word breaks, is variable to say the least. If your word counts are different in each document you'd have no idea how many to read, and there's no way in JavaScript to work out what is and is not a table cell; you're not dealing with HTML.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 04, 2017 2
New Here ,
Jul 04, 2017

Copy link to clipboard

Copied

So I'm guessing its not possible to do zone OCR with Acrobat DC? Ive seen other software that allows you to select specific zones from a pdf then create a form, which than is used as a template for other forms to extract the data in those zones.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 04, 2017 0
Most Valuable Participant ,
Jul 04, 2017

Copy link to clipboard

Copied

No it's not.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 04, 2017 2
try67 LATEST
Most Valuable Participant ,
Jul 04, 2017

Copy link to clipboard

Copied

You can't limit the OCR process to just a part of the page, but you can extract just the text from a pre-defined area.

That requires quite a complex script, though.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 04, 2017 0