Skip to main content
Participating Frequently
January 11, 2018
Question

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

  • January 11, 2018
  • 7 replies
  • 4103 views

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

I ultimately want to be able to export a "summary" into excel of a predefined set of search terms and pull the surrounding sentence or paragraph so that this exported list contains the context of how the term is being used in the document.

Is this possible??

Thanks,

This topic has been closed for replies.

7 replies

Participating Frequently
January 12, 2018

10-4 I’ll be in touch

Participating Frequently
January 12, 2018

Thank you all for the input

Participating Frequently
January 12, 2018

ok,

whatabout this idea,

search, highlight, reference markup location in doc and create a image at that location that’s page width by x number pixels tall.

these would then be compiled into a new doc, this could then be OCR, and exported to excel or used in pdf?

easier, feasible?

Joel Geraci
Community Expert
Community Expert
January 12, 2018

There are a bunch of problems with the OCR approach which I won't go into.

If your ultimate goal is automation, you don't want to use Acrobat anyway. There are developer toolkits out there that will do page decomposition and convert the PDF drawing instructions into 99.9% correct reading order and/or can read the structure tags to get it 100% correct. They are expensive but if you are doing this manually now, it'll pay for itself in no time.

The Datalogics PDF Java Toolkit is one such library, it's technology from Adobe and marketed by Datalogics.

Participating Frequently
January 12, 2018

Thanks for all this info and feedback.

would anything change in the above of the documents were in a somewhat structured format, ie the documents would be technical specifications which are typically generated and organized by a spec writing software to begin with?

thanks

Joel Geraci
Community Expert
Community Expert
January 12, 2018

Yes. It would help... there's be less heuristics for the page decomposition but even so it's still a ton of work.

Thom Parker
Community Expert
Community Expert
January 11, 2018

You'll find some free actions here that perform tasks similar to what you are asking about:

https://acrobatusers.com/actions-exchange

Or if you need a custom tool, it's what we do. Send me a message.

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Thom Parker
Community Expert
Community Expert
January 11, 2018

There are a couple of things you can do along these lines. The first thing is to setup Acrobat/Reader  so that the highlighted text is copied into the comment for the highlight annotation. You'll need to set the Acrobat Commenting Preference shown below: The preferences dialog is accesssed from the Edit menu.

After this you can get all the text by creating a comment summary. Or writing a script to collect all the text into a CSV file that could then be opened by Excel. Moving this data in to a CSV file or some other storage location requires Acrobat Pro. The easiest solution to to create the CSV as a file attachment. You can find more info here: https://www.pdfscripting.com/public/ExcelAndAcrobat.cfm

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Participating Frequently
January 12, 2018

Thom,

I am using a version of the search, highlight, action that you put together. (thanks for this by the way)

My question is would it be possible to easily modify this script to replace the redact annotation with a shaped annotation such as a rectangle and set its set properties with regards to its size?  This would be done in lieu of highlighting obviously but my thought is that this box could be used to capture the underlying text much like a highlight would but be able to capture the additional context I'm after.

If the box could be referenced and located over the searched term and give it parameters of page full page width at X height it would do exactly what I'm looking for.

Let me know if anyone thinks this would be a viable option.

Thanks,

try67
Community Expert
Community Expert
January 12, 2018

Thanks, is this a terribly difficult thing to do?

I really don't care much about full sentences or paragraphs but rather I just want to be able to "summarize" a 1,000 page pdf into 20-30 pages of "snippets" that reflect where and how my list of search terms are being used in the document.  Being that the document will be a technical spec, the context of the use of the term will be easily identified in a very small amount of adjacent text.

It is also easy enough to dig deeper into the anomalies that are very clear in this type of summarized data.

I think the extraction function of the above method is already built into acrobats functionality because it will grab the text within the "box" the same way as highlighting the text does.  Once it's annotated this way, there are lots of easy ways to get these annotations into other usable formats such as excel.

I just have no idea how to replace a highlight annotation with a specific sized box.

Thanks,


You're making a lot of assumptions there, which are not very well based.

Yes, it's fairly easy to take the coordinates of a word and enlarge them to a larger "box", but extracting the text within that box is not that simple at all. There's no relation (or very little relation) between what Acrobat can do on its own and what can be done using a script. That's not to say it's impossible, but it's not a trivial task, and there might be a lot of issues if you go beyond the line where the match was found.

try67
Community Expert
Community Expert
January 11, 2018

In theory, yes, but identifying a paragraph in a PDF file is extremely tricky (and sometimes impossible).

I've developed for my customers tools that can do similar things (extract a certain range of words around a matching term, or even a whole sentence), so if you're interested in such a tool feel free to contact me privately (try6767 at gmail.com) so we could discuss it further.