comparing two pdfs to find missing or extra pages
If you are not an expert in acrobat pro dc, stop reading at this point. I have been on-line chatting for 3 hrs to no avail. But I have what I think should be a common problem. Please read carefully before I get to the questions.
Background:
I have use cloudHQ to convert gmail labels to pdfs. there are many options so I will focus on two.
Method 1. I convert all the emails in a label to a "single, combined" pdf. Let's say its 580 pgs long, and consisted of originally 316 emails. I know it is 316 because (a) gmail says the label had 316 emails, and (b) if you seach for a unique header text like "Date Received:" you will find 316 occurrences in the pdf.
Method 2. I convert all the emails in a lable to individual pdfs. And here is where the problem starts. I only end up with 310 pdfs. And if I merge the pdfs to a new single combined pdf, I get 539 pgs.
So clearly the application is screwy. Can't fix that. My problem is that I must find the 16 extra emails in the method 1 pdf, which will be 41 extra pages compared to Method 2.
Don't really have time to write a script, so I was hoping Acrobat might have a work around.
On the surface -- and only on the surface -- the pdf's look identical, and if I search, I can find eventually the 16 emails, but I need to automate this process for some 50000 emails.
When I use the Compare tool, it does not work. It ends up highlighting all sort of things that are not perceptible to the eye. Not a surprise. This is because the two methods create pdfs that "look the same" but I presume that are slight spacing differences and so on. so acrobat picks up all of this, and makes the compare tool not viable.
Suggestions:???
1) Convert the pdf's to image pdf's and then use the OCR tool within adobe to create new pdfs??
2) other,?? I tried flattening but no cigar.
3) or is there a way to tell acrobat in batch to print pages 14-16, 52-52, 106-122, etc, to individual pdfs. This would mean if I search by headers, I can write down the pgs numbers of the pdfs, and then print to the 16 email pdfs -- not great but a bit better.
4) Or??? I can determine the page numbers of the beginning of each email, and then batch print the method 1 pdf to individual pdfs, ending up with 316 pdfs. Maybe I can then put this in a folder and find the 16 emails...
5) can acrobat break up the pdf on a query term to find each first page?
thanks
