Copy link to clipboard
Copied
Does anyone know how to get the page count of a multi page PDF file? I know you can then open each page using the PDFFileOptions (pagetoopen). But I need to know up front how many pages there are to loop thru to open each page and process each page.
Any Ideas?
Copy link to clipboard
Copied
I have the same exact question. I need to know how many pages are in a multi-page PDF in order to load them and merge some images into them.
So far, ive tried using the interapplication messaging objects (ie. BridgeTalk), and tried using Adobe Bridge to try and extract the information with no success.
Anyone know any way to get this information?
Copy link to clipboard
Copied
Only way I've found is to open the file in a text editor and look for /Count. The highest number is the total pages.
Copy link to clipboard
Copied
I know this is an old post, but I have found a way of getting the page count of a PDF in Windows and Mac using Bridge.
I was using Photoshop and BridgeTalk but the code should still work with Illustrator and BridgeTalk.
Code is here..
Copy link to clipboard
Copied
A solution that you can include in your script if you have Acrobat exchan.
Set AcroExchApp = CreateObject("AcroExch.App")
Set AcroExchPDDoc = CreateObject("AcroExch.PDDoc")
AcroExchPDDoc.Open ("c:\tmp\myPDF.pdf")
NbPages = AcroExchPDDoc.GetNumPages
AcroExchPDDoc.Close
Patrice
Copy link to clipboard
Copied
This script i did for photoshop:
OK so lets do it working on WIN and MAC without any shell script.
This solution is not the best but its working.
I used Try catch to solve it.
Only issue is that you should at least somehow know how many pages you expect in pdf.. this mean if you are working with books or flyers...
Then set the maxPagesCount to that number.
I am working with pdfs with max 20pages.. so i start open the 20 then 19 then 18 and when i am success i know the amount of pages.
Its not the quicker way but its usable.
var maxPagesCount = 20;
var actPagesCount = maxPagesCount;
var opts1 = new PDFOpenOptions();
opts1.usePageNumber = true;
opts1.antiAlias = true;
opts1.bitsPerChannel = BitsPerChannelType.EIGHT;
opts1.resolution = 10; //it will load faster the test page
opts1.suppressWarnings = true;
opts1.cropPage = CropToType.MEDIABOX;
myFunction = function () {
try {
app.displayDialogs = DialogModes.NO;
var fileList = openDialog();
for (i = 0; i < fileList.length; i++) {
actPagesCount = maxPagesCount;
getPagesCount(fileList,maxPagesCount);
alert(actPagesCount);
}
} catch (exception) {
alert(exception);
}
};
getPagesCount = function (checkFile, lastPageID) {
try {
for (var checkPage = lastPageID; checkPage > 0; checkPage--) {
opts1.page = checkPage;
var docRef = open(checkFile, opts1, false);
docRef.close(SaveOptions.DONOTSAVECHANGES);
actPagesCount = lastPageID;
return;
checkPage=0;
}
} catch (exception) {
// Look for next page
checkPage--;
getPagesCount(checkFile,checkPage);
}
};
Copy link to clipboard
Copied
So I came up with the following, but it does not work on all PDFs- at least not with ones which have been saved through Acrobat.
It looks at the file as text and finds the place where the page count is usually listed, unless it doesn't.
If it sees no digits it will do an alert and if it sees more than 3 digits (over 999 pages) same thing, because in my work we really don't have anything with that many pages.
So far it's been working OK and saved some time for me! However one thing I learned in this process is that if you store the file as a test string in a variable and do not reset the variable to empty, the extendscript toolkit will become unresponsive and has to be forced to quit.
var fileToOpen = File.openDialog("Pick a multi-page PDF","*pdf");
if(fileToOpen){
fileToOpen = new File(fileToOpen.fsName.replace("file://","")); //OS Lion fix
var userAlerts = app.userInteractionLevel = UserInteractionLevel.DONTDISPLAYALERTS;
}
if(fileToOpen){
fileToOpen.open("r");
fileToOpen.seek(0,0);
var fileStr = fileToOpen.read();
var searchIndex1 = fileStr.search("/Kids");
var searchIndex2 = fileStr.search("<</Count");
var pageCountStr = fileStr.substring (searchIndex1, searchIndex2);
fileStr = "";
fileToOpen.close();
var pageCount = [];
for(i=0; i<pageCountStr.length; i++){
if(!isNaN(pageCountStr.charAt(i)) && pageCountStr.charAt(i) != " "){
pageCount.push(pageCountStr.charAt(i));
}
}
if(pageCount.length < 1 || pageCount.length > 3){ // If you extracted pages out of a PDF using Acrobat, it won't have "<</Count" or "/Kids"
alert("Oops, either something went very wrong or the file you have chosen is invalid in some way as "+
"the page count number is "+pageCount.length+" number places! ...Stopped.");
} else {
pageCount = parseInt((pageCount.join()).replace(",",""));
alert("Your PDF has "+pageCount+" pages.");
}
}
Copy link to clipboard
Copied
Very good.
"Kids" *may* appear more than once, though, and then the objects it points to may be "Kids" arrays of their own, each with its own internal "Count". Perhaps you could use the largest value of "Count" in the entire file (a drawback is that it would need to scan the entire file, I can't see at first glance if you are doing that anyway).
In case in-deep details interest you: The proper way to read the entire structure is to parse the trailer. This will direct you to the first "Pages" object, the very first occurrence of "Kids". Theoretically, the Count value in it should be the entire document count, but I've found a few non-compliant files where this seemed not to be true. (Probably not created with Adobe software...)
To locate each object in the entire file you need to parse the xref table.
>... it does not work on all PDFs- at least not with ones which have been saved through Acrobat.
Yah ... too bad. Modern versions of Acrobat compress this all-important xref table, and even worse, it may also compress interesting sections of the file that *used to* be plain ASCII. To be able to glean useful information out of these files, you need to have at least Deflate code (Adobe's default compression), and LZW for some other PDF creating programs (Microsoft's, it seems).
xref and data compression are new to PDF 1.5 (or sth), but you *cannot* rely on the file header for this number! It's just a guide, and actual content *may* (and does) differ. But you can always open your PDF with Acrobat Pro and re-save as PDF 1.4 (again, from memory!), or save as "Optimized PDF" and deselect "compress data stream and document structure".
Recently I wrote a PDF reading code library, complete with Deflate and LZW code. Internally, it encodes all PDF objects into Javascript objects, and at that point getting a number of pages is *extremely* easy. Unfortunately, decompressing in Javascript is very slow at best, because Javascript cannot easily work with "raw data bytes". At worst, the compressed data may be *so* large -- more than a megabyte -- that Javascript simply cannot work with it *at all*, and crashes.
My hope lies in Adobe adding Byte Arrays to its variant of Javascript, because that just may be enough to make my library work at an appreciable speed. It would be the least they could do, after making life so very hard by allowing compression in their originally so elegant and eminently understandable and processable PDF format ...
Copy link to clipboard
Copied
Yes, the details to interest me, and all I can say is : Wow that's all very fascinating and mostly beyond me! Thanks!
-- Yes it is reading the entire file, but I have found at least with the pdfs I deal with, that the first instance of "<</Count" and "/Kids" contains the real pdf page number. So as long as no other important information is listed in a same syntax before that, it's working ok thus far because the string search returns the first index of the occurrences.
Copy link to clipboard
Copied
What about something like this:
function getRootObjectReference(str) {
return str.match(/trailer\s+<{2}(.|\s)*?\/Root\s+(\d+)(.|\s)*?>{2}/)[2];
}
function getPagesObjectReference(str, ref) {
var pattern = new RegExp(ref + '\\s+\\d+\\s+obj\\s+<{2}(.|\\s)*?\\/Pages\\s+(\\d+)(.|\\s)*?>{2}');
return str.match(pattern)[2];
}
function getPagesCount(str, ref) {
var pattern = new RegExp(ref + '\\s+\\d+\\s+obj\\s+<{2}(.|\\s)*?\\/Count\\s+(\\d+)(.|\\s)*?>{2}');
return str.match(pattern)[2];
}
var source = File.openDialog();
source.open('r');
source.seek(0, 0);
var content = source.read();
var rootRef = getRootObjectReference(content);
var pagesRef = getPagesObjectReference(content, rootRef);
var pageCount = getPagesCount(content, pagesRef);
var content = '';
source.close();
alert('Your PDF has ' + pageCount + ' pages');
Copy link to clipboard
Copied
You can reference Linearized info in some PDF file.
Here is an example of document structure:
%PDF-1.6
%âãÏÓ
3106 0 obj
<</Linearized 1/L 833417/O 3109/E 47747/N 65/T 771175/H [ 3756 1082]>>
endobj
xref
3106 173
0000000016 00000 n
0000004838 00000 n
0000005018 00000 n
0000005078 00000 n
/N 65/ as page length.
We can use RE like below:
/<<\/Linearized\s.+\/N\s(\d+)\/T\s.+>>/
Copy link to clipboard
Copied
Thank you for sharing!