Skip to main content
Participating Frequently
January 11, 2018
Question

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

  • January 11, 2018
  • 7 replies
  • 4107 views

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

I ultimately want to be able to export a "summary" into excel of a predefined set of search terms and pull the surrounding sentence or paragraph so that this exported list contains the context of how the term is being used in the document.

Is this possible??

Thanks,

This topic has been closed for replies.

7 replies

Participating Frequently
January 12, 2018

10-4 I’ll be in touch

Participating Frequently
January 12, 2018

Thank you all for the input

Participating Frequently
January 12, 2018

ok,

whatabout this idea,

search, highlight, reference markup location in doc and create a image at that location that’s page width by x number pixels tall.

these would then be compiled into a new doc, this could then be OCR, and exported to excel or used in pdf?

easier, feasible?

Joel Geraci
Community Expert
Community Expert
January 12, 2018

There are a bunch of problems with the OCR approach which I won't go into.

If your ultimate goal is automation, you don't want to use Acrobat anyway. There are developer toolkits out there that will do page decomposition and convert the PDF drawing instructions into 99.9% correct reading order and/or can read the structure tags to get it 100% correct. They are expensive but if you are doing this manually now, it'll pay for itself in no time.

The Datalogics PDF Java Toolkit is one such library, it's technology from Adobe and marketed by Datalogics.

Participating Frequently
January 12, 2018

Thanks for all this info and feedback.

would anything change in the above of the documents were in a somewhat structured format, ie the documents would be technical specifications which are typically generated and organized by a spec writing software to begin with?

thanks

Joel Geraci
Community Expert
Community Expert
January 12, 2018

Yes. It would help... there's be less heuristics for the page decomposition but even so it's still a ton of work.

Thom Parker
Community Expert
Community Expert
January 11, 2018

You'll find some free actions here that perform tasks similar to what you are asking about:

https://acrobatusers.com/actions-exchange

Or if you need a custom tool, it's what we do. Send me a message.

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Thom Parker
Community Expert
Community Expert
January 11, 2018

There are a couple of things you can do along these lines. The first thing is to setup Acrobat/Reader  so that the highlighted text is copied into the comment for the highlight annotation. You'll need to set the Acrobat Commenting Preference shown below: The preferences dialog is accesssed from the Edit menu.

After this you can get all the text by creating a comment summary. Or writing a script to collect all the text into a CSV file that could then be opened by Excel. Moving this data in to a CSV file or some other storage location requires Acrobat Pro. The easiest solution to to create the CSV as a file attachment. You can find more info here: https://www.pdfscripting.com/public/ExcelAndAcrobat.cfm

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Participating Frequently
January 11, 2018

Thanks, I understand how to do this process manually.  What I'm trying to do is understand if there is a way I could automate the process of searching my list of terms and having to manually highlight the surrounding context of that term so I can create an export that has the context that the term was being used in.

I have XI Pro,

I'm thinking there would be a way to script it somehow using the redact while referencing the tags, or objects, of the document somehow but my java skills are minimal at best.

Please let me know if anyone else has any ideas

Joel Geraci
Community Expert
Community Expert
January 11, 2018

You're just not going to be able to do this in a way that will work consistently across PDF files using JavaScript alone.

While it is possible to infer paragraph or sentence breaks on a page by page basis, consider the situation where a paragraph or sentence crosses a page boundary... or column break. It's not impossible but the page decomposition code is going to be overwhelming especially if you are new to Acrobat JavaScript. Acrobat JavaScript just doesn't have access to the structure information that would be helpful here.

I'd suggest you pay someone to do this for you but that's what I do for a living and I wouldn't accept the job because, unless you have a very, very, very, limited set of PDF files to search, I'd never be able to create an acceptable deliverable... and I've been at this for 20 years.

That said, a C-based plugin does have access to the structure information and could be used to develop this solution but you'd need to hire someone with those skills. I'd recommend Thom Parker who is on this thread.

try67
Community Expert
Community Expert
January 11, 2018

In theory, yes, but identifying a paragraph in a PDF file is extremely tricky (and sometimes impossible).

I've developed for my customers tools that can do similar things (extract a certain range of words around a matching term, or even a whole sentence), so if you're interested in such a tool feel free to contact me privately (try6767 at gmail.com) so we could discuss it further.