Is there a way to search, highlight, extract, but instead of extracting the entire page, I only want to highlight/extract the sentence or paragraph the term is contained in?

Report · Jan 11, 2018

I ultimately want to be able to export a "summary" into excel of a predefined set of search terms and pull the surrounding sentence or paragraph so that this exported list contains the context of how the term is being used in the document.

Is this possible??

Thanks,

Report · Jan 11, 2018

In theory, yes, but identifying a paragraph in a PDF file is extremely tricky (and sometimes impossible).

I've developed for my customers tools that can do similar things (extract a certain range of words around a matching term, or even a whole sentence), so if you're interested in such a tool feel free to contact me privately (try6767 at gmail.com) so we could discuss it further.

Report · Jan 11, 2018

There are a couple of things you can do along these lines. The first thing is to setup Acrobat/Reader so that the highlighted text is copied into the comment for the highlight annotation. You'll need to set the Acrobat Commenting Preference shown below: The preferences dialog is accesssed from the Edit menu.

After this you can get all the text by creating a comment summary. Or writing a script to collect all the text into a CSV file that could then be opened by Excel. Moving this data in to a CSV file or some other storage location requires Acrobat Pro. The easiest solution to to create the CSV as a file attachment. You can find more info here: https://www.pdfscripting.com/public/ExcelAndAcrobat.cfm

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Jan 11, 2018

Thanks, I understand how to do this process manually. What I'm trying to do is understand if there is a way I could automate the process of searching my list of terms and having to manually highlight the surrounding context of that term so I can create an export that has the context that the term was being used in.

I have XI Pro,

I'm thinking there would be a way to script it somehow using the redact while referencing the tags, or objects, of the document somehow but my java skills are minimal at best.

Please let me know if anyone else has any ideas

Report · Jan 11, 2018

You're just not going to be able to do this in a way that will work consistently across PDF files using JavaScript alone.

While it is possible to infer paragraph or sentence breaks on a page by page basis, consider the situation where a paragraph or sentence crosses a page boundary... or column break. It's not impossible but the page decomposition code is going to be overwhelming especially if you are new to Acrobat JavaScript. Acrobat JavaScript just doesn't have access to the structure information that would be helpful here.

I'd suggest you pay someone to do this for you but that's what I do for a living and I wouldn't accept the job because, unless you have a very, very, very, limited set of PDF files to search, I'd never be able to create an acceptable deliverable... and I've been at this for 20 years.

That said, a C-based plugin does have access to the structure information and could be used to develop this solution but you'd need to hire someone with those skills. I'd recommend Thom Parker who is on this thread.

Report · Jan 11, 2018

Thank you for the info and perspective on what I'm looking to do, it's much appreciated.

Before I head down the bigger programming solution rabbit hole, I want to make sure I explore the capabilities of java fully.

Do you believe it would be possible to use the redact with a partial word and then designate the character count to be a couple hundred characters. I think this would accomplish something similar.

The saved search function also has some capabilities that could be useful, is there a way to manipulate that output?

Thanks again, this is super helpful

Report · Jan 11, 2018

You started in the rabbit hole. When it comes to JavaScript text extraction and inferring semantic word order, the rabbit hole is deep and the rabbit hole is wide.

If you worked under the assumption that the text in the PDF was completely linear... which you shouldn't... you would be able to find the word object that is within the bounding box of the redaction annotation. Then you could get, for example, the 25 words before and after it. That would at least give you some context for the redaction but, as I mentioned earlier. The semantic reading order of the words and the word order returned by JavaScript (not Java) are generally not the same.

If you're taking in PDF from random sources, you're going to get random results.

Report · Jan 11, 2018

Joel hasn't mentioned the next biggest issue with this, which is performance. You could conceivably write a script that acquires all the words and their positions, and does the necessary analysis to determine correct order and document structure. But beside that fact that this is a horrendously difficult bit of programming, it would take a very long time to run.

I know because I've actually written this type of program before as a C++ plug-in, and it was slow. JS is 100 times slower, literally.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Jan 12, 2018

Thom,

I am using a version of the search, highlight, action that you put together. (thanks for this by the way)

My question is would it be possible to easily modify this script to replace the redact annotation with a shaped annotation such as a rectangle and set its set properties with regards to its size? This would be done in lieu of highlighting obviously but my thought is that this box could be used to capture the underlying text much like a highlight would but be able to capture the additional context I'm after.

If the box could be referenced and located over the searched term and give it parameters of page full page width at X height it would do exactly what I'm looking for.

Let me know if anyone thinks this would be a viable option.

Thanks,

Report · Jan 12, 2018

That might actually be a viable solution. You could use the annotation as a sort of seed value to define a rectangle that is the width of the page and then a certain y value above and below the annotation, then extract all of the words on the page and detect which ones are within the larger rectangle. You won't necessarily get complete sentences and paragraphs... but you would get the context.

Report · Jan 12, 2018

Thanks, is this a terribly difficult thing to do?

I really don't care much about full sentences or paragraphs but rather I just want to be able to "summarize" a 1,000 page pdf into 20-30 pages of "snippets" that reflect where and how my list of search terms are being used in the document. Being that the document will be a technical spec, the context of the use of the term will be easily identified in a very small amount of adjacent text.

It is also easy enough to dig deeper into the anomalies that are very clear in this type of summarized data.

I think the extraction function of the above method is already built into acrobats functionality because it will grab the text within the "box" the same way as highlighting the text does. Once it's annotated this way, there are lots of easy ways to get these annotations into other usable formats such as excel.

I just have no idea how to replace a highlight annotation with a specific sized box.

Thanks,

Report · Jan 12, 2018

You're making a lot of assumptions there, which are not very well based.

Yes, it's fairly easy to take the coordinates of a word and enlarge them to a larger "box", but extracting the text within that box is not that simple at all. There's no relation (or very little relation) between what Acrobat can do on its own and what can be done using a script. That's not to say it's impossible, but it's not a trivial task, and there might be a lot of issues if you go beyond the line where the match was found.

Report · Jan 12, 2018

I guess what I'm saying is that the script doesn't have to do any extraction.

For example if I just draw a rectangle drawing markup on the pdf, the text within that gets populated (as long as the preference is set correctly) into the markup annotation itself in the comments area all of that text is then visable and part of the markup. From that point on I can just use exportation of the markups/annotations to reassemble the summary and create my deliverable as needed.

Am I missing something on that?

Report · Jan 12, 2018

Yes. When you create an annotation using a script you can't use that function that automatically copies the selected text into it.

That only works when you create a comment manually.

That is why I've developed a separate script that allows you to do it retroactively:

Custom-made Adobe Scripts: Acrobat -- Retroactively Copy Highlighted Text into Comments

Report · Jan 11, 2018

You'll find some free actions here that perform tasks similar to what you are asking about:

https://acrobatusers.com/actions-exchange

Or if you need a custom tool, it's what we do. Send me a message.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Jan 11, 2018

Thanks for all this info and feedback.

would anything change in the above of the documents were in a somewhat structured format, ie the documents would be technical specifications which are typically generated and organized by a spec writing software to begin with?

thanks

Report · Jan 11, 2018

Yes. It would help... there's be less heuristics for the page decomposition but even so it's still a ton of work.

Report · Jan 11, 2018

ok,

whatabout this idea,

search, highlight, reference markup location in doc and create a image at that location that’s page width by x number pixels tall.

these would then be compiled into a new doc, this could then be OCR, and exported to excel or used in pdf?

easier, feasible?

Report · Jan 12, 2018

There are a bunch of problems with the OCR approach which I won't go into.

If your ultimate goal is automation, you don't want to use Acrobat anyway. There are developer toolkits out there that will do page decomposition and convert the PDF drawing instructions into 99.9% correct reading order and/or can read the structure tags to get it 100% correct. They are expensive but if you are doing this manually now, it'll pay for itself in no time.

The Datalogics PDF Java Toolkit is one such library, it's technology from Adobe and marketed by Datalogics.

Report · Jan 11, 2018

Thank you all for the input

Report · Jan 12, 2018

10-4 I’ll be in touch