Split large pdf on repeated text pattern, and save new pdf with custom filename

Explorer ,
Sep 15, 2017 Sep 15, 2017

Copy link to clipboard

Copied

I have Acrobat Pro DC

I have a problem in my current organisation which uses a very old fashioned HR system for recruitment. Our HR system compiles one massive report of all the job applications for a recent post: the pdf is 1700+ pages long, containing distinct sections (of variable length) for over 200 applicants.

I want to split this into one pdf per applicant, with the filename of each document being the applicant's name.

For each new application, a consistently formatted divider page exists as follows:

Applicant : Smith, John

Vacancy ID : 15535

The text 'Vacancy ID' only exists on these divider pages, so it can be used to identify where to split the document.

The applicant's name, which occurs on a previous line, starts at character 10 and is variable length. In fact it can be acquired with getPageNthWord(page,3) and getPageNthWord(page,4)

How easy would it be to create some javascript to run in an action which would do the following:

  1. Identify text "Vacancy ID"
  2. Split document at that point, saving the pages from current page (typically 5, though not always) up to page before next instance of "Vacancy ID"
  3. Extract applicant name from previous line
  4. Save individual pdf for each applicant, using applicant name

Can this be done, or has it been done already? Thanks

TOPICS
Acrobat SDK and JavaScript, Windows

Views

3.6K

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct Answer

Explorer , Sep 16, 2017 Sep 16, 2017
Thanks. Unfortunately I don't have a budget for this work so I figured it out myself. Here is the solution in case anyone else needs to do something similar. Obviously you will need to tweak the code for your scenario. I ran this in the javascript debugger using instructions (eg select code and press ctrl enter) from this site https://acrobatusers.com/tutorials/javascript_consoleIn short, this script does the following:For each page in document, look for the word "Vacancy" at word number 8If tha...

Likes

translate

Translate

Translate
Most Valuable Participant ,
Sep 15, 2017 Sep 15, 2017

Copy link to clipboard

Copied

If the pages are consistent and the text readable (ie, not part of a scanned image), then yes, it can most likely be done.

I've developed many similar tools for my clients in the past, so if you wish to send me some sample pages (to try6767 at gmail.com) I'll be happy to let you know if I think it's doable or not, and if so, for how much.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Sep 16, 2017 Sep 16, 2017

Copy link to clipboard

Copied

Thanks. Unfortunately I don't have a budget for this work so I figured it out myself. Here is the solution in case anyone else needs to do something similar. Obviously you will need to tweak the code for your scenario. I ran this in the javascript debugger using instructions (eg select code and press ctrl enter) from this site https://acrobatusers.com/tutorials/javascript_console

In short, this script does the following:

  1. For each page in document, look for the word "Vacancy" at word number 8
  2. If that exists, check the next work (9) is ID. This means we've found the text "Vacancy ID"
  3. Extract first name and surname from fixed positions on the same page
  4. Now continue through the document until we find the next instance of "Vacancy ID"
  5. Make a note of it's page number (p2). This will help to define how to use the extractpages() function
  6. Finally, extract the last item

I'm sure there are lots of better ways of doing it, but this works for me, it took about an hour, and I didn't have to pay anyone (sorry try67). Also, someone else might be able to use this for free in future. Let me know if you have any problems and I'll try to help. I've never used JavaScript before but it doesn't seem to be too hard. Debugging in acrobat however is AWFUL! Good luck.

var firstName = ""

var surName = ""

var finalpage = 0

var count = 0

//For each page in document, check whether specific words meet criteria

for (var p = 0; p < this.numPages; p++) {

  if (this.getPageNthWord(p, 8) == "Vacancy") {

    if (this.getPageNthWord(p, 9) == "ID") {

      count++;

      firstName = getPageNthWord(p, 3);

      surName = getPageNthWord(p, 2);

      finalpage = p;

      //Find page position of next break point

      for (var p2 = p + 1; p2 < this.numPages; p2++) {

        if (this.getPageNthWord(p2, 8) == "Vacancy") {

          if (this.getPageNthWord(p2, 9) == "ID") {

            this.extractPages({

              nStart: p,

              nEnd: p2-1,

              cPath: count + " " + firstName + " " + surName + ".pdf"

            });

            console.println("Extracted " + firstName + " " + surName + " pp " + p + " to " + p2)

            break

          }

        }

      }

    }

  }

}

//Save final section after last time run through

this.extractPages({

  nStart: finalpage,

  nEnd: this.numPages - 1,

  cPath: count + " " + firstName + " " + surName + ".pdf"

});

console.println("Extracted " + firstName + " " + surName + " pp " + finalpage + " to " + (this.numPages - 1))

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 18, 2019 Nov 18, 2019

Copy link to clipboard

Copied

Perphaps you can help me. I have a problem similar to what you had.
I have a file that has serveral pages. At random intervals there are pages that have the words "PageBreak".
Looking for a script that will
1. For each page in document, look for the word "PageBreak"
2. Extract all pages before and including the page with the first instance of "PageBreak" into a new document.
3. Continue through the document until we find the next instance of "PageBreak" and repeat step 2
I have attached a file. On the file, pages 1 and 2 would be extracted to a new file,
pages 3-5 would be extracted to another new file
pages 6-7 would be a new file and page 8 would be left and should be on a new file by itself.
I understand you are not an expert, neither am I and I could really use some guidance on getting this accomplished.
Any help would be greatly appreciated.
Thank you in advance.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Nov 18, 2019 Nov 18, 2019

Copy link to clipboard

Copied

This is possible, but not a simple project if you don't have any experience at all in writing Acrobat JavaScript code. The offer I made above still stands... I'm happy to develop this tool for you, for a small fee. My contact details are the same.

Likes

translate

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines