Skip to main content
Participating Frequently
January 11, 2018
質問

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

  • January 11, 2018
  • 返信数 7.
  • 4103 ビュー

Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

I ultimately want to be able to export a "summary" into excel of a predefined set of search terms and pull the surrounding sentence or paragraph so that this exported list contains the context of how the term is being used in the document.

Is this possible??

Thanks,

このトピックへの返信は締め切られました。

返信数 7

Participating Frequently
January 12, 2018

10-4 I’ll be in touch

Participating Frequently
January 12, 2018

Thank you all for the input

Participating Frequently
January 12, 2018

ok,

whatabout this idea,

search, highlight, reference markup location in doc and create a image at that location that’s page width by x number pixels tall.

these would then be compiled into a new doc, this could then be OCR, and exported to excel or used in pdf?

easier, feasible?

Joel Geraci
Community Expert
Community Expert
January 12, 2018

There are a bunch of problems with the OCR approach which I won't go into.

If your ultimate goal is automation, you don't want to use Acrobat anyway. There are developer toolkits out there that will do page decomposition and convert the PDF drawing instructions into 99.9% correct reading order and/or can read the structure tags to get it 100% correct. They are expensive but if you are doing this manually now, it'll pay for itself in no time.

The Datalogics PDF Java Toolkit is one such library, it's technology from Adobe and marketed by Datalogics.

Participating Frequently
January 12, 2018

Thanks for all this info and feedback.

would anything change in the above of the documents were in a somewhat structured format, ie the documents would be technical specifications which are typically generated and organized by a spec writing software to begin with?

thanks

Joel Geraci
Community Expert
Community Expert
January 12, 2018

Yes. It would help... there's be less heuristics for the page decomposition but even so it's still a ton of work.

Thom Parker
Community Expert
Community Expert
January 11, 2018

You'll find some free actions here that perform tasks similar to what you are asking about:

https://acrobatusers.com/actions-exchange

Or if you need a custom tool, it's what we do. Send me a message.

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Thom Parker
Community Expert
Community Expert
January 11, 2018

There are a couple of things you can do along these lines. The first thing is to setup Acrobat/Reader  so that the highlighted text is copied into the comment for the highlight annotation. You'll need to set the Acrobat Commenting Preference shown below: The preferences dialog is accesssed from the Edit menu.

After this you can get all the text by creating a comment summary. Or writing a script to collect all the text into a CSV file that could then be opened by Excel. Moving this data in to a CSV file or some other storage location requires Acrobat Pro. The easiest solution to to create the CSV as a file attachment. You can find more info here: https://www.pdfscripting.com/public/ExcelAndAcrobat.cfm

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Participating Frequently
January 12, 2018

Thom,

I am using a version of the search, highlight, action that you put together. (thanks for this by the way)

My question is would it be possible to easily modify this script to replace the redact annotation with a shaped annotation such as a rectangle and set its set properties with regards to its size?  This would be done in lieu of highlighting obviously but my thought is that this box could be used to capture the underlying text much like a highlight would but be able to capture the additional context I'm after.

If the box could be referenced and located over the searched term and give it parameters of page full page width at X height it would do exactly what I'm looking for.

Let me know if anyone thinks this would be a viable option.

Thanks,

Joel Geraci
Community Expert
Community Expert
January 12, 2018

That might actually be a viable solution. You could use the annotation as a sort of seed value to define a rectangle that is the width of the page and then a certain y value above and below the annotation, then extract all of the words on the page and detect which ones are within the larger rectangle. You won't necessarily get complete sentences and paragraphs... but you would get the context.

try67
Community Expert
Community Expert
January 11, 2018

In theory, yes, but identifying a paragraph in a PDF file is extremely tricky (and sometimes impossible).

I've developed for my customers tools that can do similar things (extract a certain range of words around a matching term, or even a whole sentence), so if you're interested in such a tool feel free to contact me privately (try6767 at gmail.com) so we could discuss it further.