Skip to main content
Participant
May 12, 2025
Question

Sort pages of an OCR scan pdf, by 2 fields that are the bottom of each page

  • May 12, 2025
  • 1 reply
  • 259 views

I would like to sort pages of a scanned pdf file by 2 text fields that are at the bottom of each page ("Page x of x" and "Print Date: xx/xx/xxxx").

 

1 reply

Thom Parker
Community Expert
Community Expert
May 12, 2025

As long as there is a clean OCR, a script can be written to aquire the page text using the "doc.getPageNthWord()" function. The page text should be concatonated into a large string, which can  be searched using  regular expressions for the key elements needed for sorting. I would suggest creating an array of the search elements, so there is an entry for each page. The entries could be an object containing the original page number and the search elements.  The array can then be sorted, giving the new locations where the pages are to be moved. 

 

I would assume the "# of #" text is used to group pages, where as the date is the real sorting element. Is this correct?

 

This is not a trivial operation. I've done this sort of thing many times. 

 

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often