Extract PDF Pages Based on Content multiple times

Report · Jul 05, 2018

Hello. I'm a beginner in javascript, and I have adobe acrobat X pro.

I want to be able to search for a specific string within the pdf, and then save the sequence of numbers that come after that string for my file name. Then I would want to check if the following pages have that same exact sequence of numbers, and if there are then I want to extract all the pages with that certain number sequence into one pdf. However, I want to be able to keep looking for new number sequences after I have finished extracting the pages with the first number sequence.

For example,

page 1 NO: 0158K

page 2 NO: 0158K

page 3 NO: 0158K

page 4 NO: 9090V

page 5 NO: 223M

page 6 NO: 223M

Using this example, pages 1, 2, and 3 would be extracted into one pdf together. Page 4 would be extracted by itself, and pages 5 and 6 would be extracted into one pdf.

I kind of have an idea of how to do this, but I'm not quite sure how to implement it, or combine some of the code that I found.

So far I think I have to use an array to put all the pages with the same number sequence in that array, then once all of the pages with that number sequence is located I have to extract it. I was thinking of using something similar to the code from this forum https://forums.adobe.com/message/7931552#7931552 with a few modifications to the code like having an if statements in the nested for loop to look for the number sequence.

So far this is what I have...

var pageArray=[];

for (var p = 0; p < this.numPages; p++) {

for(var n = 0; n<this.getPageNumWords(p); n++){

if(this.getPageNthWord(p,n)=="PPNO"){

dataCode=this.getPageNthWord(p,n+1)

pageArray.push(p);

break;

}

for (var p2=p+1; p2 < this.numPages; p2++){

for (var n2=0; n2<this.getPageNumWords(p2); n2++){

if(this.getPageNthWord(p2, n2)=="PPNO"){

if(this.getPageNthWord(p2, n2+1)==dataCode){

repeat++;

break;

}

else{

if (pageArray.length > 0) {

var d = app.newDoc();

for (var x=0; x<pageArray.length; x++){

d.insertPages( {

nPage:d.numPages-1,

cPath: dataCode + ".pdf",

nStart: pageArray,

nEnd:pageArray,

} );

}

d.deletePages(0);

}

break;

}

but after I ran this code, all I got was a new pdf with a blank page.

Report · Jul 05, 2018

HI,

I haven't had a chance to properly test your code but looking at it there are a couple of things that don't look right, so I will list them and you can see if you agree and make the changes and then see where we stand.

1. In the document you have the text "NO" and in the code you compare that to "PPNO", guessing that is just a type when you made the forum post, but thought I should mention it.

2. When you go to add the pages you use the following code

for (var x=0; x<pageArray.length; x++){
    d.insertPages(  {
    nPage:d.numPages-1,
    cPath: dataCode + ".pdf",
    nStart: pageArray,
    nEnd:pageArray,
    });
}

There are a couple of issues, cPath, is set to dataCode.pdf, but cPath should be the device independent path to the file you want to get the pages from, not the file you are placing the pages into, so this should be the full path to the original file.

and you are passing nStart and nEnd as the same page, this is not necessary, as if you just want one page, just pass nStart and that will be the only page that is included.

Hope this helps

Malcolm

Report · Jul 06, 2018

Hello. Yeah the "NO" is a typo that I made when I posted the forum.

As for the second part, should that section of code end up being like this then?

for (var x=0; x<pageArray.length; x++){

d.insertPages( {

nPage:d.numPages-1,

cPath: this.path,

nStart: pageArray,

} );

}

However, I ran the code with this corrected portion, and I didn't get a new pdf at all.

Report · Jul 06, 2018

You mean after using

var d = app.newDoc();

you can't see the new document?

Report · Jul 06, 2018

Yes, I just replaced that portion of the code from my original code. And no there was no new document.

Report · Jul 06, 2018

Looks like that app.newDoc(); will never used.

Report · Jul 07, 2018

Hi,

Using the following code I am able to get 2 documents created.

PPNO: 0158K

PPNO: 9090V

are both created as separate files.

// Using the active document ( i only have one document open, made testing easiser)
var curDoc = app.activeDocs[0];
var pageArray=[];
var repeat = 0;
var dataCode = "";
for (var p = 0; p < curDoc.numPages; p++)
{
    for(var n = 0; n< curDoc.getPageNumWords(p); n++)
    {
       if(curDoc.getPageNthWord(p,n)=="PPNO")
       {
            dataCode=curDoc.getPageNthWord(p,n+1) ;
            pageArray.push(p);
            break;
       }
    }
    for (var p2=p+1; p2 < curDoc.numPages; p2++)
    {
        for (var n2=0; n2<curDoc.getPageNumWords(p2); n2++)
        {
            if(curDoc.getPageNthWord(p2, n2)=="PPNO")
            {
                // This if is why we only get two files as a result, 
                // because we can only get to the else if we don't match, but for the last 
                // number in the document we will never have a page that doesn't match
                if(curDoc.getPageNthWord(p2, n2+1)==dataCode)
                {
                    repeat++;
                    break;
                }
                else
                {
                    if (pageArray.length > 0)
                    {
                        var d = app.newDoc();
                        for (var x=0; x<pageArray.length; x++)
                        {
                          d.insertPages(
                          {
                            nPage:d.numPages-1,
                            // changed to use the curDoc
                            cPath: curDoc.path,
                            // as we are importing 1 page at a time.
                            nStart: pageArray,
                          });
                        }
                        d.deletePages(0);
                    }
                    // reset so we get only the new pages.
                    pageArray = [];
                }
            }
        }
    break;
    }
}

There are a couple of changes to the code, the main ones where the changes I mentioned, the other is to make sure we reset the page array so that we don't included the pages we found on the first run through of the loop on the second loop.

Hope this helps.

Malcolm

Report · Jul 10, 2018

Thanks the code helped a lot. The only thing I'm still having a problem with is that the new pdfs don't save, they only show up as a temporary pdf. Hence, the name for the new pdfs have temp at the end. Also, I can't figure out how to customize the name for the new pdfs. Because the main reason why I put cPath: dataCode + ".pdf" in my original post was because I wanted to write the code so that the name of the new pdfs would be the dataCode. Like the name of the new pdfs would be 0158K and 9090V.

Report · Jul 10, 2018

HI,

You can just call

d.saveAs ( "/path/to/save/location/" + dataCode + ".pdf");

just after the d.deletePages(); line

Hope the helps

Malcolm

Report · Jul 11, 2018

Okay thank you. And sorry I have one last question. I realized that this code would skip over the last few pages of a pdf if all of the dataCodes matched. I tried to add something at the end of the code...

curDoc.extractPages({

nStart: finalpage,

nEnd: curDoc.numPages - 1,

cPath: dataCode + ".pdf"

});

(I did classify finalpage as a variable, and made it equal to p during the first nested loop.)

in order to account for the last few pages. Something similar to the code from Split large pdf on repeated text pattern, and save new pdf with custom filename . However, I don't think that part of my code is even read because other than the new pdfs that were being made from the additional code, nothing else is being made.

Report · Jul 12, 2018

HI,

I have refactored the code a little to solve the problem, based on the sample document, comments are in the code so you can see what I have done, as always any question just ask away.

var curDoc = app.activeDocs[0];
var pageArray=[];
var repeat = 0;
var dataCode = "";
var startPage = pageArray[0];
var startPageNumber = 0;
var lastPageNumber = curDoc.numPages;
lastPageNumber--;
// This part gets all the page numbers from the document as before
for (var p = 0; p < curDoc.numPages; p++)
{
    for(var n = 0; n< curDoc.getPageNumWords(p); n++)
    {
       if(curDoc.getPageNthWord(p,n)=="PPNO")
       {
            dataCode=curDoc.getPageNthWord(p,n+1) ;
            pageArray.push(dataCode);
            break;
       }
    }
}
// This bit has been refactored to stop the need to go through all the pages again
// it also uses the ability of insertPages to insert more than one page at a time.
for ( var i = 1; i < pageArray.length; i++)
{
    var endPageNumber = i - 1;
    
    // if we have a match, AND we are not the last page, keep going
    if (( startPage === pageArray) && ( i !== lastPageNumber))
    {
        exportFile = false
    }
    // if we are the last page, we don't care about a match anymore.
    else if ( i === lastPageNumber)
    {
        // catch if we are at the end of the document
        exportFile = true;
        endPageNumber = i;
    }
    // we are not the last page, and we are not a match for the pages we are looking for
    else
    {
        // catch when we have passed the current page
        exportFile = true;
    }
    // once we have some files to process.
    if ( exportFile)
    {
        d = app.newDoc();
        // call insert pages once with the page range to insert.
        d.insertPages (
        {
            nPage: d.numPages -1,
            cPath: curDoc.path,
            nStart: startPageNumber,
            nEnd : endPageNumber,
        });
        // remove initial page
        d.deletePages(0);
        // set up for the next run
        startPage = pageArray;
        startPageNumber = i;
    }
}

Hope this helps

Malcolm

Adobe Community

Extract PDF Pages Based on Content multiple times

1 Correct answer