Skip to main content
Participant
March 14, 2017
Question

Removing duplicate pages in PDFs?

  • March 14, 2017
  • 2 replies
  • 583 views

I often need to convert emails to pdfs in my job, because sometimes the email contains certain information that needs to be redacted, such as student information.  I work for a university.  When the email being requested is pulled from several different accounts there is often the same email that appears in each account.  Is there a way to remove the duplicates before I start to review and redact, so I don't have to keep making the same redactions over and over?

Thanks for any help.

This topic has been closed for replies.

2 replies

Karl Heinz  Kremer
Community Expert
Community Expert
March 15, 2017

This might be possible. Here is how I would approach this:

You cannot get access to the actual PDF content on a page, all you can do is iterate over all "words" on a page. What Acrobat considers a word may not be identical to your interpretation in all cases. You could then create a "checksum" for all pages in your document and then try to identify pages that result in the same checksum. Depending on how you create this checksum, you may then still have to compare the pages word by word to make sure you are dealing with an exact duplicate. You would then mark the duplicate page as one that needs to be deleted, and in a final step, delete the pages from the end of the document.

If you need help with any of these steps, that's what I do for a living

Inspiring
March 15, 2017

Moving your discussion to see if anyone in the JavaScript area knows of a way to do this.