Split large pdf on repeated text pattern, and save new pdf with custom filename

Report · Sep 15, 2017

I have Acrobat Pro DC

I have a problem in my current organisation which uses a very old fashioned HR system for recruitment. Our HR system compiles one massive report of all the job applications for a recent post: the pdf is 1700+ pages long, containing distinct sections (of variable length) for over 200 applicants.

I want to split this into one pdf per applicant, with the filename of each document being the applicant's name.

For each new application, a consistently formatted divider page exists as follows:

Applicant : Smith, John

Vacancy ID : 15535

The text 'Vacancy ID' only exists on these divider pages, so it can be used to identify where to split the document.

The applicant's name, which occurs on a previous line, starts at character 10 and is variable length. In fact it can be acquired with getPageNthWord(page,3) and getPageNthWord(page,4)

How easy would it be to create some javascript to run in an action which would do the following:

Identify text "Vacancy ID"
Split document at that point, saving the pages from current page (typically 5, though not always) up to page before next instance of "Vacancy ID"
Extract applicant name from previous line
Save individual pdf for each applicant, using applicant name

Can this be done, or has it been done already? Thanks

Report · Sep 15, 2017

If the pages are consistent and the text readable (ie, not part of a scanned image), then yes, it can most likely be done.

I've developed many similar tools for my clients in the past, so if you wish to send me some sample pages (to try6767 at gmail.com) I'll be happy to let you know if I think it's doable or not, and if so, for how much.

Report · Sep 16, 2017

Thanks. Unfortunately I don't have a budget for this work so I figured it out myself. Here is the solution in case anyone else needs to do something similar. Obviously you will need to tweak the code for your scenario. I ran this in the javascript debugger using instructions (eg select code and press ctrl enter) from this site https://acrobatusers.com/tutorials/javascript_console

In short, this script does the following:

For each page in document, look for the word "Vacancy" at word number 8
If that exists, check the next work (9) is ID. This means we've found the text "Vacancy ID"
Extract first name and surname from fixed positions on the same page
Now continue through the document until we find the next instance of "Vacancy ID"
Make a note of it's page number (p2). This will help to define how to use the extractpages() function
Finally, extract the last item

I'm sure there are lots of better ways of doing it, but this works for me, it took about an hour, and I didn't have to pay anyone (sorry try67). Also, someone else might be able to use this for free in future. Let me know if you have any problems and I'll try to help. I've never used JavaScript before but it doesn't seem to be too hard. Debugging in acrobat however is AWFUL! Good luck.

var firstName = ""
var surName = ""
var finalpage = 0
var count = 0
//For each page in document, check whether specific words meet criteria
for (var p = 0; p < this.numPages; p++) {
  if (this.getPageNthWord(p, 8) == "Vacancy") {
    if (this.getPageNthWord(p, 9) == "ID") {
      count++;
      firstName = getPageNthWord(p, 3);
      surName = getPageNthWord(p, 2);
      finalpage = p;
      //Find page position of next break point
      for (var p2 = p + 1; p2 < this.numPages; p2++) {
        if (this.getPageNthWord(p2, 8) == "Vacancy") {
          if (this.getPageNthWord(p2, 9) == "ID") {
            this.extractPages({
              nStart: p,
              nEnd: p2-1,
              cPath: count + " " + firstName + " " + surName + ".pdf"
            });
            console.println("Extracted " + firstName + " " + surName + " pp " + p + " to " + p2)
            break
          }
        }
      }
    }
  }
}
//Save final section after last time run through
this.extractPages({
  nStart: finalpage,
  nEnd: this.numPages - 1,
  cPath: count + " " + firstName + " " + surName + ".pdf"
});
console.println("Extracted " + firstName + " " + surName + " pp " + finalpage + " to " + (this.numPages - 1))

Report · Nov 18, 2019

Perphaps you can help me. I have a problem similar to what you had.
I have a file that has serveral pages. At random intervals there are pages that have the words "PageBreak".
Looking for a script that will
1. For each page in document, look for the word "PageBreak"
2. Extract all pages before and including the page with the first instance of "PageBreak" into a new document.
3. Continue through the document until we find the next instance of "PageBreak" and repeat step 2
I have attached a file. On the file, pages 1 and 2 would be extracted to a new file,
pages 3-5 would be extracted to another new file
pages 6-7 would be a new file and page 8 would be left and should be on a new file by itself.
I understand you are not an expert, neither am I and I could really use some guidance on getting this accomplished.
Any help would be greatly appreciated.
Thank you in advance.

Report · Nov 18, 2019

This is possible, but not a simple project if you don't have any experience at all in writing Acrobat JavaScript code. The offer I made above still stands... I'm happy to develop this tool for you, for a small fee. My contact details are the same.

Adobe Community

Split large pdf on repeated text pattern, and save new pdf with custom filename

1 Correct answer