Skip to main content
Inspiring
July 5, 2024
Question

Extract and Compare text between two PDFs

  • July 5, 2024
  • 4 replies
  • 2416 views

Hi,

I need a tool / JavaScript to extract and compare the text between two pdfs or in plain simple language, I want to find out the missing text.

 

SourceCopy.pdf – contains the original/source text that should be available in the FinalArtwork.pdf
FinalArtwork.pdf – The final PDF that should hold all the copy that is available in the SourceCopy.pdf


The source and final might contain the same text in multiple places. For example, '10 years' might be available thrice in the SourceCopy.pdf, so it should find three instances in the FinalArtwork.pdf.

 

So, the script should create a new text file on the desktop containing the missing text. If nothing is missing, then the text file should say, 'Nothing is missing, good work!'

 

On comparing the files manully, I figured out that only line of text is missing in the FinalArtwork.pdf i.e.

Missing Lines:
N/A from Data-File.ai

Comparison complete!

 

Can you please help me on this. Thanks in advance.

This topic has been closed for replies.

4 replies

bebarth
Community Expert
Community Expert
July 10, 2024

Hi,

I was looking for an interesting exercise when I came across your request which seems to be the case.

Before starting something in the next weekend, could you let me know a couple of things.

It seems only the layout must be different between both source and final files. Is the order of texts the same in both files in the case of nothing is missing?

Should we notice the texts in addition to the final file such as these texts at the bottom of your final file?

In my opinion, it is possible to write a script that can extract the missing text. Maybe not easy... but doable.

I'll let you know on Monday...

@+

Inspiring
July 10, 2024

Hi,
The FinalArtwork could contain any layout.
The text order in the SourceCopy and FinalArtwork will be different.
The Final might contain additional text compared to the text in SourceCopy.

The Final MUST contains all the text that is available in the SourceCopy. If any text/sentence is missing, then it should be notified to the user.

The source might contain multiple occurrences of a text/sentence and therefore, the same text/sentence should have equal multiple occurrences in the Final artwork. For example, if a word, let say, 'Color: Black' is available twice in the Source copy, then it should find two occurrences of the word 'Color: Blac' in the Final file. If it finds, one occurence then the second should be notified to the user.

bebarth
Community Expert
Community Expert
July 14, 2024

Hi,

I started writing a script for comparing your files and it's progressing quite well.

While doing my tests, I realized there was a problem with some words which use special characters from your alphabet...

As shown in this screenshot, extracting these words before comparing, they are not written in the same way in your source and final files (while they are written identically in both file).

For example the word "Można" is extracted from the SourceCopy file and "Mozna" from the FinalArtwork file. So, these  both words can't match...

I don't know why in the FinalArtwork file some letters are not extracted correctly! Maybe because of the font...

I will try to replace these letter while comparing words and I'll let you know...

@+

Inspiring
July 8, 2024

Hi Thom, I tried the Acrobat Compare tool, but the results are unsatisfactory. I also tried a couple of other things, but nothing worked.

 

Hi try67, Yes, you're right, I'm trying to achive this with the same approach. First, I am trying to extract the texts from both the PDFs and then trying to compare and find out the missing ones, but after so many attempts, nothing is coming out. 

Thom Parker
Community Expert
Community Expert
July 8, 2024

Have you tried saving the PDF as "text"?   And then using the Windows file compare tool?  It does a pretty good job.  

 

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Thom Parker
Community Expert
Community Expert
July 5, 2024

Have you tried the "Compare Files" tool in Acrobat? 

 

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
try67
Community Expert
Community Expert
July 5, 2024

This is not as simple as it might appear. As soon as a difference is found, it's very difficult to match the rest of the text. There are pre-existing tools that can do it much better than a script in Acrobat. I would extract the text manually (or even using a script) and then use one of those tools for the comparison. You can use Word or even the free Notepad++ for that.