Find Specific Text String and Automatically Create Bookmarks Based on That String

Report · Apr 03, 2020

Is it possible in Acrobat to automatically create/insert bookmarks when a particular string (ex: Order #) is encountered? I am trying to create individual work order files from one large PDF file (using Split Document into multiple files using bookmarks), but I need to create bookmarks each time the string "Order #" is encountered. Because the pages vary based on the work order specs (drawings, material needed, instructions, etc.), this text string is not located in a predictable spot on each page. Once the "Order #" is found, I need to insert a bookmark that includes "Order #" and the next 9 characters that come after it. I know how to do it manually, but there could be 100 or more orders in one file. Any help is greatly appreciated...thanks!

Report · Apr 03, 2020

Yes, if the text can be identified based on a specific pattern then it should be possible, but will require a custom-made script.

By the way, a script can just split the file directly. There's no need to create bookmarks and then use the Split Document command based on that...

I've developed many similar tools for my clients and would be happy to create one for you as well (for a fee, of course).

You can contact me privately via [try6767 at gmail.com] to discuss it further.

Report · Apr 03, 2020

See the search and highlight Action here. If it can be used to find your words, then it's a short trip to splitting the PDF.

https://www.acrobatusers.com/actions-exchange/

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 13, 2020

Thank you for getting back to me so quickly. 🙂

I have been able to use the Action you referred me to but instead of having it search for just "Order #", I need it to also highlight the 9 characters afterwards(1 space and 8 numbers that refer to each work order) so that when I split the file into multiple PDFs, each one has "Order #" plus the 8 digit work order number. How do I do this?

Thanks again!

Report · Apr 13, 2020

That's more complicated. To do that you need to either write a custom JavaScript search, or specify a custom redaction search pattern.

The custom redaction search pattern is easier:

https://blogs.adobe.com/acrolaw/2011/05/creating_and_using_custom_redact/

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 21, 2020

I have tried for over a week to figure out how to either create a javascript (which I am completely new to) or a custom redaction pattern, but I just end up getting more confused. Either I need to:

1) automatically insert a book mark at each text string ("Order # ") , which doesn't seem like it would be a difficult task or

2) use the find, highlight and extract javascript, which I can get to highlight "Order # " plus the 8-digit number that follows, but it will not do the extract portion to individual files, or lastly

3) create a custom redaction pattern, which I have located the xml file, added a new Entry 6, but can't figure out how to make it search for the text string "Order # 12345678" and either insert a bookmark or extract the pages from this point to the next occurrence of "Order #".

Please believe that it is not for lack of trying, but I really need to get this figured out and need to know which method is the easiest to pursue, and the finishing step to achieve it. I guess I really need to get a "Javascript for Dummies" book, since it isn't as easy as VBA to pick up on. Thank you again for any input. 🙂

Report · Apr 22, 2020

If you are getting the find and highlight to work then you are very close. Did you look in the console to see if there were any errors? Did you try the other Action that extracts commented pages?

If you need help with this (and have a budget) then contact me through www.windjack.com. I can get the Action customized for exactly what you want.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · May 04, 2020

So I've had to change my approach somewhat because the only consistent location (mostly, anyway) that has the work order number is the last word on each page. I understand that I need to use the getPageNumWords(p) to get the number of pages and then use getPageNthWord(p, n), where n=getPageNumWords(p)-1. There will, however, be some pages that do not have the WO number on it, so I would like them to default to the WO number on the page before. Using the Extract example I came up with the following (please bear in mind that this is my first attempt at Javascript code):

WoNo = 0

var NumWrds = 0

var finalpage = 0

var count = 0

cPath: this.path

//For each page in document, check whether specific words meet criteria

for (var p = 0; p < this.numPages; p++) {

NumWrds = getPageNumWords(p)

WoNo = getPageNthWord(p, NumWrds - 1)

if (this.getPageNthWord(p, NumWrds - 1) == WoNo) {

count++;

finalpage = p;}

else

{ WoNo = getPageNthWord (p-1,NumWrds - 1);

finalpage = p;}

//Find page position of next break point

for (var p2 = p + 1; p2 < this.numPages; p2++) {

if (this.getPageNthWord(p2, NumWrds) == WoNo) {

this.extractPages({

nStart: p,

nEnd: p2-1,

cPath: WoNo + " " + ".pdf"

});

console.println("Extracted " + WoNo + " " + " pp " + p + " to " + p2)

break

}

//Save final section after last time run through

this.extractPages({

nStart: finalpage,

nEnd: this.numPages - 1,

cPath: count + " " + WoNo + " " + ".pdf"

});

console.println("Extracted " + WoNo + " " + " pp " + finalpage + " to " + (this.numPages - 1))

The results are going to the first "null" page that doesn't have a WO # on it and then extracts all the WO pages after it (which do have WO #'s on them) for 34 additional pages. The file is 116 pages. What am I doing wrong?! Please help...

Report · May 04, 2020

Following condition is always true:

if (this.getPageNthWord(p, NumWrds - 1) == WoNo) {

What want you compare here?

Report · May 05, 2020

For your first attempt at JS, you've written a lot of advanced code. I'd suggest backing off a bit and doing some testing in the Console window to get a handle on your process.

So as Bernd says, the comparison is meaningless because it compares a value to itself.

What you need to do is test the result for a valid order number format. Use a Regular expression

https://www.pdfscripting.com/public/Pattern-Matching-with-Regular-Expressions.cfm

You also need to verify that the order number is really the last word, and the format it is in when acquired, i.e. punctuation, white space, etc. Use the Console.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 04, 2020

If I understand correctly, there is no point in creating bookmarks since what you want to do is extract pages based on their content.
In this case you are lucky because it is precisely the subject of this thread which provides several versions of scripts to do this.

Google translate is your friend: https://abracadabrapdf.net/forum/index.php/topic,3410.0.html

Acrobate du PDF, InDesigner et Photoshopographe

Report · May 05, 2020

Is it not possible just to do a nested calculation with getPageNumWords and getPageNthWord to get the last word on the page, for example, getPageNthWord(p, (getPageNumwords(p) - 1)). Then if the result does not resemble 2#######, the value of the previous page is used?

Report · May 05, 2020

You can do what you want to do with a script. No Problem.

But a calculation script is not the correct location for this type of code. This needs to be either a batch or folder level script.

Like I said earlier, you need to did a bit of code testing in the console window.

Run this code on the console, when a page with the order number is displayed.

this.getPageNthWord(this.pageNum, (this.getPageNumwords(this.pageNum) - 1))

What is the exact text that is returned?

When you can verify this, you can then create a regular expression to identify the order number.

And then we can help you to design a complete script to perform this task.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · May 07, 2020

If I use this.getPageNthWord(this.pageNum, (this.getPageNumwords(this.pageNum) - 1)) I get exactly what I need, the work order number EXCEPT on a few pages that do have the WO number anywhere on the page (drawings, maps, etc.) In the case where the page does not have a WO number, I would like to use the WO number from the previous page, since these unmarked pages come after the main WO pages. I'm thinking an IF...ELSE statement could handle this, but I'm not sure what the exact code needs to be. Thank you for taking the time to help me.

Report · May 07, 2020

So, the idea is to acquire the last word on the page and then test it with a regular expression to determine whether or not it is a WO number.

var rgWONum = /...../;

var cWONum = null, cLastWord;

for(pg=0;pg<this.numPages;pg++)

{

cLastWord = this.getPageNthWord(pg, this.getPageNumwords(pg) - 1);

if(rgWONum.test(cLastWord))

cWONum= cLastWord;

}

There's the basic script, you'll need to fill out the regular expression. Here's an article on the topic:

https://www.pdfscripting.com/public/Pattern-Matching-with-Regular-Expressions.cfm

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · May 07, 2020

So if the WO is always 8 digits that start with a 2, it would be /2\ddddddd/?

Report · May 07, 2020

No. It would be:

/^2\d{7}$/

Report · May 08, 2020

The work order in the last word is always preceded by the date (mm/dd/yyyy) the report was created, so would would I need the ^ before the 2 since it's not the beginning of the line. Wow, I was way off on my first guess...thanks for helping me!

Report · Aug 05, 2020

I have finally created a custom redaction pattern that I can use to find and mark for redaction the work order numbers that I need to be used as the new file names. My question now is how do I extract each group of pages with the same work order into multiple files and name the new files after each work order. The Find, Highlight and Extract action only puts them into one file, and I need them to create a new file per work order number. Thanks for any help you can give me.

Report · Aug 05, 2020

The next step after running the redaction search is to loop through all the redact annots. Use the annot rectangle to find the text at that location, i.e. the order number. Then find all the pages associated with this number and extract them to a separate file. Repeat until you've run through the annots.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often