• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Extract and Compare text between two PDFs

Participant ,
Jul 05, 2024 Jul 05, 2024

Copy link to clipboard

Copied

Hi,

I need a tool / JavaScript to extract and compare the text between two pdfs or in plain simple language, I want to find out the missing text.

 

SourceCopy.pdf – contains the original/source text that should be available in the FinalArtwork.pdf
FinalArtwork.pdf – The final PDF that should hold all the copy that is available in the SourceCopy.pdf


The source and final might contain the same text in multiple places. For example, '10 years' might be available thrice in the SourceCopy.pdf, so it should find three instances in the FinalArtwork.pdf.

 

So, the script should create a new text file on the desktop containing the missing text. If nothing is missing, then the text file should say, 'Nothing is missing, good work!'

 

On comparing the files manully, I figured out that only line of text is missing in the FinalArtwork.pdf i.e.

Missing Lines:
N/A from Data-File.ai

Comparison complete!

 

Can you please help me on this. Thanks in advance.

TOPICS
JavaScript , PDF

Views

289

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 05, 2024 Jul 05, 2024

Copy link to clipboard

Copied

This is not as simple as it might appear. As soon as a difference is found, it's very difficult to match the rest of the text. There are pre-existing tools that can do it much better than a script in Acrobat. I would extract the text manually (or even using a script) and then use one of those tools for the comparison. You can use Word or even the free Notepad++ for that.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 05, 2024 Jul 05, 2024

Copy link to clipboard

Copied

Have you tried the "Compare Files" tool in Acrobat? 

 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Jul 08, 2024 Jul 08, 2024

Copy link to clipboard

Copied

Hi Thom, I tried the Acrobat Compare tool, but the results are unsatisfactory. I also tried a couple of other things, but nothing worked.

 

Hi try67, Yes, you're right, I'm trying to achive this with the same approach. First, I am trying to extract the texts from both the PDFs and then trying to compare and find out the missing ones, but after so many attempts, nothing is coming out. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 08, 2024 Jul 08, 2024

Copy link to clipboard

Copied

If you're interested I could write for you (for a fee) a script that will compare the texts and report the first instance of text that doesn't match, if any differences are found. You can contact me privately via PM to discuss it further.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 08, 2024 Jul 08, 2024

Copy link to clipboard

Copied

Have you tried saving the PDF as "text"?   And then using the Windows file compare tool?  It does a pretty good job.  

 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 10, 2024 Jul 10, 2024

Copy link to clipboard

Copied

Hi,

I was looking for an interesting exercise when I came across your request which seems to be the case.

Before starting something in the next weekend, could you let me know a couple of things.

It seems only the layout must be different between both source and final files. Is the order of texts the same in both files in the case of nothing is missing?

Should we notice the texts in addition to the final file such as these texts at the bottom of your final file?

Capture d’écran 2024-07-10 à 21.33.49.png

In my opinion, it is possible to write a script that can extract the missing text. Maybe not easy... but doable.

I'll let you know on Monday...

@+

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Jul 10, 2024 Jul 10, 2024

Copy link to clipboard

Copied

Hi,
The FinalArtwork could contain any layout.
The text order in the SourceCopy and FinalArtwork will be different.
The Final might contain additional text compared to the text in SourceCopy.

The Final MUST contains all the text that is available in the SourceCopy. If any text/sentence is missing, then it should be notified to the user.

The source might contain multiple occurrences of a text/sentence and therefore, the same text/sentence should have equal multiple occurrences in the Final artwork. For example, if a word, let say, 'Color: Black' is available twice in the Source copy, then it should find two occurrences of the word 'Color: Blac' in the Final file. If it finds, one occurence then the second should be notified to the user.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 14, 2024 Jul 14, 2024

Copy link to clipboard

Copied

Hi,

I started writing a script for comparing your files and it's progressing quite well.

Capture d’écran 2024-07-14 à 16.32.15.png

While doing my tests, I realized there was a problem with some words which use special characters from your alphabet...

Capture_d’écran_2024-07-14_à_16_47_08.png

As shown in this screenshot, extracting these words before comparing, they are not written in the same way in your source and final files (while they are written identically in both file).

For example the word "Można" is extracted from the SourceCopy file and "Mozna" from the FinalArtwork file. So, these  both words can't match...

I don't know why in the FinalArtwork file some letters are not extracted correctly! Maybe because of the font...

I will try to replace these letter while comparing words and I'll let you know...

@+

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 16, 2024 Jul 16, 2024

Copy link to clipboard

Copied

Hi,

Here is what I did at the moment. The final layout is still missing but before do it, could you check this script on a few file.

You must choose 2 files from the open files:

Capture d’écran 2024-07-16 à 21.15.55.png

Capture d’écran 2024-07-16 à 21.12.53.png

 Then after clicking OK you will get the result in the console window.

Capture d’écran 2024-07-16 à 21.13.12.png

 The script is a bit more complicated as I thougth because of what I try to explain in my previous answer...

Here is the script you can run from the console window or an action wizard:

lesDocs=[];
openDocs=app.activeDocs;
for (var d=0; d<openDocs.length; d++) lesDocs.push(openDocs[d].documentFileName);
if (lesDocs.length<2) {
	app.alert("You need to open 2 files to be able to compare them.",3);
} else {
	var laListe="- Select -";
	var listeDocuments="var listeDocuments \= \{\""+laListe+"\": "+(lesDocs.length+1)+",";
	for (var i=0; i<lesDocs.length; i++) {
				listeDocuments+="\""+lesDocs[i]+"\": "+(-1*(i+1)).toString()+",";
	}
	var listeDocuments=listeDocuments.substring(0, listeDocuments.length-1);
	listeDocuments+="\}";
	eval(listeDocuments);
	var bDialogue={
		initialize: function(bDialogue) {
			this.loadDefaults(bDialogue);
		},
		loadDefaults: function(bDialogue) {
			bDialogue.load({
				sour: listeDocuments,
				fina: listeDocuments,
			})
		},
	validate: function(bDialogue) {
		var oRslt=bDialogue.store();
		var docSource=bDialogue.store()["sour"];
		var docFinal=bDialogue.store()["fina"];
		var testOK=true;
		for (var i in docSource) {
			if (docSource[i]>0) {
				nomSource=i;
				valeurSource=listeDocuments[i];
			}
		}
		for (var i in docFinal) {
			if (docFinal[i]>0) {
				nomFinal=i;
				valeurFinal=listeDocuments[i];
			}
		}
		if (valeurSource>0 || valeurFinal>0 || valeurSource==valeurFinal) var testOK=false;
		if (!testOK) app.alert("Please select 2 different files to compare them.",3);
		return testOK;
	},
		description: {
			name: "Files Comparison",
			elements: [
				{
					type: "view", //
					elements: [
						{
							type: "view",
							alignment: "align_top",
							elements: [
								{
									type: "static_text",
									name: "Source File",
									font: "dialog",
									bold: true,
								},
								{
									type: "popup",
									item_id: "sour",
									width: 150,
								},
								{
									type: "gap",
									height: 2
								},
								{
									type: "static_text",
									name: "Final File",
									font: "dialog",
									bold: true,
								},
								{
									type: "popup",
									item_id: "fina",
									width: 150,
								},
							]
						},
						{
							type: "gap",
							height: 10
						},
						{
							type: "ok_cancel",
						},
					]
				},
			]
		}
	};
	if("ok"==app.execDialog(bDialogue)){
		var separateur="#@&";
		var hauteurMP=0;
		function remplacementMots(leTexte) {
			return leTexte.replace(/ [^\S]+/g," ").replace(/^(\d+)$/,"$1 ").replace(/\u001E/g,"");
		}
		function remplacementSuites(leTexte) {
			return leTexte.replace(/^\s+|\s+$/g,"").replace(/•(\d|\w)/,"• $1").replace(/[ ]{2,}/g," ");
		}
		function suites(leDoc) {
			var lesSuites=[];
			var laPage=0;
			for (var p=0; p<leDoc.numPages; p++) {
				var aRect=leDoc.getPageBox("Crop",p);
				var basMot=aRect[1];
				for (var i=0; i<leDoc.getPageNumWords(p); i++) {
					var leMot=leDoc.getPageNthWord(p,i,false);
					var q=leDoc.getPageNthWordQuads(p,i);
					m=(new Matrix2D).fromRotated(leDoc,p);
					mInv=m.invert();
					r=mInv.transform(q);
					r=r.toString();
					r=r.split(",");
					var hauteurMot=Number(r[1])-Number(r[5]);
					if (!hauteurMP) var hauteurMP=hauteurMot;
					var deltaHM=(hauteurMot/hauteurMP).toFixed(2);
					var interligne=basMot-Number(r[5]);
					if (deltaHM!=1 || (laPage==p && deltaHM==1 && interligne/hauteurMot>1.2) || /^• /.test(leMot) || (laPage!=p && /^[\w\d]/.test(leMot) && leMot.charAt(0)==leMot.toUpperCase().charAt(0))) {
						lesSuites.push(remplacementMots(leMot));
					} else {
						lesSuites[lesSuites.length-1]+=remplacementMots(leMot);
					}
					if (hauteurMot!=hauteurMP) var hauteurMP=hauteurMot;
					basMot=Number(r[5]);
					var laPage=p;
				}
			}
			for (var i=0; i<lesSuites.length; i++) lesSuites[i]=lesSuites[i].replace(/([\d\w])\.([^pa]|p(?!df)|a(?!i))/ig,"$1."+separateur+"$2");
			for (var i=0; i<lesSuites.length; i++) {
				if (lesSuites[i].indexOf(separateur)>-1) lesSuites[i]=lesSuites[i].split(separateur)
			}
			var suitesDecomposees=[];
			for (var i=0; i<lesSuites.length; i++) {
				if (typeof lesSuites[i]!="object") suitesDecomposees.push(remplacementSuites(lesSuites[i]));
				else {
					for (var j=0; j<lesSuites[i].length; j++) suitesDecomposees.push(remplacementSuites(lesSuites[i][j]));
				}
			}
			return suitesDecomposees;
		}
		//
		for (var d=0; d<openDocs.length; d++) {
			if (openDocs[d].documentFileName==nomSource) suitesSource=suites(openDocs[d]);
			if (openDocs[d].documentFileName==nomFinal) suitesFinal=suites(openDocs[d]);
		}
		function aRemplacer(leTexte) {
			var lesLettres={
				"ą": "a",
				"ă": "a",
				"ć": "c",
				"ę": "e",
				"ł": "l",
				"ń": "n",
				"ó": "o",
				"ś": "s",
				"ș": "s",
				"ț": "t",
				"ż": "z",
				"ź": "z",
				"Ą": "A",
				"Ċ": "C",
				"Ę": "E",
				"Ł": "L",
				"Ń": "N",
				"Ó": "O",
				"Ś": "S",
				"Ż": "Z",
				"Ź": "Z"
			};
			return leTexte.replace(/[ąăćęłńóśșțżźĄĊĘŁŃÓŚŻŹ]/g, function(laLettre) {return lesLettres[laLettre]});
		}
		var trouves=[];
		for (var i=0; i<suitesSource.length; i++) {
			for (var j=0; j<suitesFinal.length; j++) {
				if (aRemplacer(suitesSource[i]).replace(/[^\w\d]/g,"").indexOf(aRemplacer(suitesFinal[j]).replace(/[^\w\d]/g,""))==0) {
					trouves.push(suitesSource[i]);
					suitesSource.splice(i,1);
					suitesFinal.splice(j,1);
					i--;
					break;
				} else {
					// Vérification en retirant les caractères ascii<32
					if (aRemplacer(suitesSource[i]).length==aRemplacer(suitesFinal[j]).length) {
						var laSource=[];
						var lefinal=[];
						for (var k=0; k<aRemplacer(suitesSource[i]).length; k++) {
							if (aRemplacer(suitesFinal[j]).charCodeAt(k)>31) {
								laSource.push(aRemplacer(suitesSource[i])[k]);
								lefinal.push(aRemplacer(suitesFinal[j])[k]);
							}
						}
						if (laSource.toString()==lefinal.toString()) {
							trouves.push(suitesSource[i]);
							suitesSource.splice(i,1);
							suitesFinal.splice(j,1);
							i--;
							break;
						}
					}
					//
				}
				if (aRemplacer(suitesSource[i]).replace(/[^\w\d]/g,"").length==0) {
					suitesSource.splice(i,1);
					i--;
				}
			}
		}
		console.clear();
		if (suitesSource.length) {
			console.clear();
			console.show();
			console.println(suitesSource.length+" Missing information of the \""+nomSource+"\" file in the \""+nomFinal+"\" file:");
			for (var i=0; i<suitesSource.length; i++) console.println("\r"+(i+1)+": "+suitesSource[i]);
			console.println("\r\rComparison complete!");
			app.alert("Missing information of the \""+nomSource+"\" file in the \""+nomFinal+"\" file are shown in the console window.",3);
		} else {
			console.hide();
			app.alert("Nothing is missing, good work!",3);
		}
	}
}

Let me know!

@+

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 16, 2024 Jul 16, 2024

Copy link to clipboard

Copied

LATEST

Complicating things for you is that your text encoding is different in both files, so a lot of your alternative languages using accents and extra glyphs will not match.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines