Extract and Compare text between two PDFs

Report · Jul 05, 2024

Hi,

I need a tool / JavaScript to extract and compare the text between two pdfs or in plain simple language, I want to find out the missing text.

SourceCopy.pdf – contains the original/source text that should be available in the FinalArtwork.pdf
FinalArtwork.pdf – The final PDF that should hold all the copy that is available in the SourceCopy.pdf

The source and final might contain the same text in multiple places. For example, '10 years' might be available thrice in the SourceCopy.pdf, so it should find three instances in the FinalArtwork.pdf.

So, the script should create a new text file on the desktop containing the missing text. If nothing is missing, then the text file should say, 'Nothing is missing, good work!'

On comparing the files manully, I figured out that only line of text is missing in the FinalArtwork.pdf i.e.

Missing Lines:
N/A from Data-File.ai

Comparison complete!

Can you please help me on this. Thanks in advance.

Report · Jul 05, 2024

This is not as simple as it might appear. As soon as a difference is found, it's very difficult to match the rest of the text. There are pre-existing tools that can do it much better than a script in Acrobat. I would extract the text manually (or even using a script) and then use one of those tools for the comparison. You can use Word or even the free Notepad++ for that.

Report · Jul 05, 2024

Have you tried the "Compare Files" tool in Acrobat?

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Jul 08, 2024

Hi Thom, I tried the Acrobat Compare tool, but the results are unsatisfactory. I also tried a couple of other things, but nothing worked.

Hi try67, Yes, you're right, I'm trying to achive this with the same approach. First, I am trying to extract the texts from both the PDFs and then trying to compare and find out the missing ones, but after so many attempts, nothing is coming out.

Report · Jul 08, 2024

If you're interested I could write for you (for a fee) a script that will compare the texts and report the first instance of text that doesn't match, if any differences are found. You can contact me privately via PM to discuss it further.

Report · Jul 08, 2024

Have you tried saving the PDF as "text"? And then using the Windows file compare tool? It does a pretty good job.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Jul 10, 2024

Hi,

I was looking for an interesting exercise when I came across your request which seems to be the case.

Before starting something in the next weekend, could you let me know a couple of things.

It seems only the layout must be different between both source and final files. Is the order of texts the same in both files in the case of nothing is missing?

Should we notice the texts in addition to the final file such as these texts at the bottom of your final file?

In my opinion, it is possible to write a script that can extract the missing text. Maybe not easy... but doable.

I'll let you know on Monday...

@+

Report · Jul 10, 2024

Hi,
The FinalArtwork could contain any layout.
The text order in the SourceCopy and FinalArtwork will be different.
The Final might contain additional text compared to the text in SourceCopy.

The Final MUST contains all the text that is available in the SourceCopy. If any text/sentence is missing, then it should be notified to the user.

The source might contain multiple occurrences of a text/sentence and therefore, the same text/sentence should have equal multiple occurrences in the Final artwork. For example, if a word, let say, 'Color: Black' is available twice in the Source copy, then it should find two occurrences of the word 'Color: Blac' in the Final file. If it finds, one occurence then the second should be notified to the user.

Report · Jul 14, 2024

Hi,

I started writing a script for comparing your files and it's progressing quite well.

While doing my tests, I realized there was a problem with some words which use special characters from your alphabet...

As shown in this screenshot, extracting these words before comparing, they are not written in the same way in your source and final files (while they are written identically in both file).

For example the word "Można" is extracted from the SourceCopy file and "Mozna" from the FinalArtwork file. So, these both words can't match...

I don't know why in the FinalArtwork file some letters are not extracted correctly! Maybe because of the font...

I will try to replace these letter while comparing words and I'll let you know...

@+

Report · Jul 16, 2024

Hi,

Here is what I did at the moment. The final layout is still missing but before do it, could you check this script on a few file.

You must choose 2 files from the open files:

Then after clicking OK you will get the result in the console window.

The script is a bit more complicated as I thougth because of what I try to explain in my previous answer...

Here is the script you can run from the console window or an action wizard:

lesDocs=[];
openDocs=app.activeDocs;
for (var d=0; d<openDocs.length; d++) lesDocs.push(openDocs[d].documentFileName);
if (lesDocs.length<2) {
	app.alert("You need to open 2 files to be able to compare them.",3);
} else {
	var laListe="- Select -";
	var listeDocuments="var listeDocuments \= \{\""+laListe+"\": "+(lesDocs.length+1)+",";
	for (var i=0; i<lesDocs.length; i++) {
				listeDocuments+="\""+lesDocs[i]+"\": "+(-1*(i+1)).toString()+",";
	}
	var listeDocuments=listeDocuments.substring(0, listeDocuments.length-1);
	listeDocuments+="\}";
	eval(listeDocuments);
	var bDialogue={
		initialize: function(bDialogue) {
			this.loadDefaults(bDialogue);
		},
		loadDefaults: function(bDialogue) {
			bDialogue.load({
				sour: listeDocuments,
				fina: listeDocuments,
			})
		},
	validate: function(bDialogue) {
		var oRslt=bDialogue.store();
		var docSource=bDialogue.store()["sour"];
		var docFinal=bDialogue.store()["fina"];
		var testOK=true;
		for (var i in docSource) {
			if (docSource[i]>0) {
				nomSource=i;
				valeurSource=listeDocuments[i];
			}
		}
		for (var i in docFinal) {
			if (docFinal[i]>0) {
				nomFinal=i;
				valeurFinal=listeDocuments[i];
			}
		}
		if (valeurSource>0 || valeurFinal>0 || valeurSource==valeurFinal) var testOK=false;
		if (!testOK) app.alert("Please select 2 different files to compare them.",3);
		return testOK;
	},
		description: {
			name: "Files Comparison",
			elements: [
				{
					type: "view", //
					elements: [
						{
							type: "view",
							alignment: "align_top",
							elements: [
								{
									type: "static_text",
									name: "Source File",
									font: "dialog",
									bold: true,
								},
								{
									type: "popup",
									item_id: "sour",
									width: 150,
								},
								{
									type: "gap",
									height: 2
								},
								{
									type: "static_text",
									name: "Final File",
									font: "dialog",
									bold: true,
								},
								{
									type: "popup",
									item_id: "fina",
									width: 150,
								},
							]
						},
						{
							type: "gap",
							height: 10
						},
						{
							type: "ok_cancel",
						},
					]
				},
			]
		}
	};
	if("ok"==app.execDialog(bDialogue)){
		var separateur="#@&";
		var hauteurMP=0;
		function remplacementMots(leTexte) {
			return leTexte.replace(/ [^\S]+/g," ").replace(/^(\d+)$/,"$1 ").replace(/\u001E/g,"");
		}
		function remplacementSuites(leTexte) {
			return leTexte.replace(/^\s+|\s+$/g,"").replace(/•(\d|\w)/,"• $1").replace(/[ ]{2,}/g," ");
		}
		function suites(leDoc) {
			var lesSuites=[];
			var laPage=0;
			for (var p=0; p<leDoc.numPages; p++) {
				var aRect=leDoc.getPageBox("Crop",p);
				var basMot=aRect[1];
				for (var i=0; i<leDoc.getPageNumWords(p); i++) {
					var leMot=leDoc.getPageNthWord(p,i,false);
					var q=leDoc.getPageNthWordQuads(p,i);
					m=(new Matrix2D).fromRotated(leDoc,p);
					mInv=m.invert();
					r=mInv.transform(q);
					r=r.toString();
					r=r.split(",");
					var hauteurMot=Number(r[1])-Number(r[5]);
					if (!hauteurMP) var hauteurMP=hauteurMot;
					var deltaHM=(hauteurMot/hauteurMP).toFixed(2);
					var interligne=basMot-Number(r[5]);
					if (deltaHM!=1 || (laPage==p && deltaHM==1 && interligne/hauteurMot>1.2) || /^• /.test(leMot) || (laPage!=p && /^[\w\d]/.test(leMot) && leMot.charAt(0)==leMot.toUpperCase().charAt(0))) {
						lesSuites.push(remplacementMots(leMot));
					} else {
						lesSuites[lesSuites.length-1]+=remplacementMots(leMot);
					}
					if (hauteurMot!=hauteurMP) var hauteurMP=hauteurMot;
					basMot=Number(r[5]);
					var laPage=p;
				}
			}
			for (var i=0; i<lesSuites.length; i++) lesSuites[i]=lesSuites[i].replace(/([\d\w])\.([^pa]|p(?!df)|a(?!i))/ig,"$1."+separateur+"$2");
			for (var i=0; i<lesSuites.length; i++) {
				if (lesSuites[i].indexOf(separateur)>-1) lesSuites[i]=lesSuites[i].split(separateur)
			}
			var suitesDecomposees=[];
			for (var i=0; i<lesSuites.length; i++) {
				if (typeof lesSuites[i]!="object") suitesDecomposees.push(remplacementSuites(lesSuites[i]));
				else {
					for (var j=0; j<lesSuites[i].length; j++) suitesDecomposees.push(remplacementSuites(lesSuites[i][j]));
				}
			}
			return suitesDecomposees;
		}
		//
		for (var d=0; d<openDocs.length; d++) {
			if (openDocs[d].documentFileName==nomSource) suitesSource=suites(openDocs[d]);
			if (openDocs[d].documentFileName==nomFinal) suitesFinal=suites(openDocs[d]);
		}
		function aRemplacer(leTexte) {
			var lesLettres={
				"ą": "a",
				"ă": "a",
				"ć": "c",
				"ę": "e",
				"ł": "l",
				"ń": "n",
				"ó": "o",
				"ś": "s",
				"ș": "s",
				"ț": "t",
				"ż": "z",
				"ź": "z",
				"Ą": "A",
				"Ċ": "C",
				"Ę": "E",
				"Ł": "L",
				"Ń": "N",
				"Ó": "O",
				"Ś": "S",
				"Ż": "Z",
				"Ź": "Z"
			};
			return leTexte.replace(/[ąăćęłńóśșțżźĄĊĘŁŃÓŚŻŹ]/g, function(laLettre) {return lesLettres[laLettre]});
		}
		var trouves=[];
		for (var i=0; i<suitesSource.length; i++) {
			for (var j=0; j<suitesFinal.length; j++) {
				if (aRemplacer(suitesSource[i]).replace(/[^\w\d]/g,"").indexOf(aRemplacer(suitesFinal[j]).replace(/[^\w\d]/g,""))==0) {
					trouves.push(suitesSource[i]);
					suitesSource.splice(i,1);
					suitesFinal.splice(j,1);
					i--;
					break;
				} else {
					// Vérification en retirant les caractères ascii<32
					if (aRemplacer(suitesSource[i]).length==aRemplacer(suitesFinal[j]).length) {
						var laSource=[];
						var lefinal=[];
						for (var k=0; k<aRemplacer(suitesSource[i]).length; k++) {
							if (aRemplacer(suitesFinal[j]).charCodeAt(k)>31) {
								laSource.push(aRemplacer(suitesSource[i])[k]);
								lefinal.push(aRemplacer(suitesFinal[j])[k]);
							}
						}
						if (laSource.toString()==lefinal.toString()) {
							trouves.push(suitesSource[i]);
							suitesSource.splice(i,1);
							suitesFinal.splice(j,1);
							i--;
							break;
						}
					}
					//
				}
				if (aRemplacer(suitesSource[i]).replace(/[^\w\d]/g,"").length==0) {
					suitesSource.splice(i,1);
					i--;
				}
			}
		}
		console.clear();
		if (suitesSource.length) {
			console.clear();
			console.show();
			console.println(suitesSource.length+" Missing information of the \""+nomSource+"\" file in the \""+nomFinal+"\" file:");
			for (var i=0; i<suitesSource.length; i++) console.println("\r"+(i+1)+": "+suitesSource[i]);
			console.println("\r\rComparison complete!");
			app.alert("Missing information of the \""+nomSource+"\" file in the \""+nomFinal+"\" file are shown in the console window.",3);
		} else {
			console.hide();
			app.alert("Nothing is missing, good work!",3);
		}
	}
}

Let me know!

@+

Report · Jul 16, 2024

Complicating things for you is that your text encoding is different in both files, so a lot of your alternative languages using accents and extra glyphs will not match.

Adobe Community

Extract and Compare text between two PDFs