Skip to main content
Inspiring
March 7, 2024
Answered

Detect dublicate gif files without OCR

  • March 7, 2024
  • 3 replies
  • 2728 views

I have high number of GIF files in a folder that each GIF file include a year number like following: 

 

But among the GIFs in that folder, it may have been repeated hundreds of times every year 

Now I want to detect dublicate GIF files by photoshop without OCR 

Are there any idea to do this? 

for example open each GIF file in photoshop and select whole of year number and then save specific characteristics or coordinates of selected section in txt file and then compare all specific characteristics or coordinates in txt file and find dublicate GIFs. 

 

note that don't ask why I want to do this and what is source of this GIF files and why I don't want use OCR 

OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned! 

 

I attach 10 number of GIF files to this post for test. 

 

This topic has been closed for replies.
Correct answer Pubg32486011zfgs

The trouble is the line in your images. They won't create the same hash.


Now I wrote following script that check each two GIF files in a folder in very high speed! 

import cv2
import numpy as np
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
import keyboard

def calculate_similarity_percentage(file1, file2):
    # Read the GIF files
    gif1 = cv2.VideoCapture(file1)
    gif2 = cv2.VideoCapture(file2)

    # Read the first frame of each GIF
    _, frame1 = gif1.read()
    _, frame2 = gif2.read()

    # Resize frames to a common size
    frame1 = cv2.resize(frame1, (640, 480))
    frame2 = cv2.resize(frame2, (640, 480))

    # Convert frames to grayscale
    gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
    gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

    # Apply Gaussian blur to the frames
    blurred1 = cv2.GaussianBlur(gray1, (15, 15), 0)
    blurred2 = cv2.GaussianBlur(gray2, (15, 15), 0)

    # Find the absolute difference between the two blurred frames
    diff = cv2.absdiff(blurred1, blurred2)

    # Threshold the difference image
    _, thresholded = cv2.threshold(diff, 30, 255, cv2.THRESH_BINARY)

    # Calculate the difference percentage
    total_pixels = thresholded.size
    non_zero_pixels = cv2.countNonZero(thresholded)
    difference_percentage = (non_zero_pixels / total_pixels) * 100

    # Calculate the similarity percentage
    similarity_percentage = 100 - difference_percentage

    return similarity_percentage

def compare_files_chunk(files_chunk, directory):
    results = []
    for i in range(1, len(files_chunk)):
        file_path1 = os.path.join(directory, files_chunk[i - 1])
        file_path2 = os.path.join(directory, files_chunk[i])
        similarity_percentage = calculate_similarity_percentage(file_path1, file_path2)
        results.append(f"Comparison between {files_chunk[i - 1]} and {files_chunk[i]}: {similarity_percentage:.2f}%")

    return results

def compare_all_files(directory, output_file):
    files = sorted([f for f in os.listdir(directory) if f.lower().endswith('.gif')])
    num_files = len(files)
    chunk_size = min(num_files, os.cpu_count() * 4)  # Adjust the chunk size based on your system's capabilities

    with open(output_file, 'w') as f_out, ProcessPoolExecutor() as executor:
        futures = []

        for i in range(0, num_files, chunk_size):
            files_chunk = files[i:i + chunk_size]
            future = executor.submit(compare_files_chunk, files_chunk, directory)
            futures.append(future)

        for future in as_completed(futures):
            results = future.result()
            f_out.write('\n'.join(results) + '\n')

            # Check for 'F6' key press to stop the process
            if keyboard.is_pressed('F6'):
                print("Process stopped by user.")
                return

def main():
    directory = r'E:\Desktop\Armies\L1816_2'
    output_file = r'E:\Desktop\Armies\comparison_results.txt'

    try:
        compare_all_files(directory, output_file)
        print("Comparison completed. Results saved in:", output_file)

    except Exception as e:
        print(f"Error: {str(e)}")

if __name__ == "__main__":
    main()

3 replies

Abambo
Community Expert
Community Expert
March 7, 2024

How many different years are there? If you have only several years, you could create a mask for each year, you mask out the lines, save the resulting file as a PNG. Those with the same size are probably the same.

ABAMBO | Hard- and Software Engineer | Photographer
Inspiring
March 7, 2024

from 1816 to 2023

Abambo
Community Expert
Community Expert
March 8, 2024

It would be faster to recreate those files. 

ABAMBO | Hard- and Software Engineer | Photographer
Abambo
Community Expert
Community Expert
March 7, 2024
quote

 

 why I don't want use OCR 

OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned! 

 


By @Pubg32486011zfgs

Well, if you want a real accurate way to do that, without OCR, I would say you will need to check each file and note the year in an Excel file. It will be the only accurate method. I suppose your file names are continuous, and your files are not always the same (line). That excludes the very practical option of calculating a checksum.

 

With 13000 files and a minute per file (which would be a lot), you could do it in a month time (less, if you have more people working for you).

 

 

ABAMBO | Hard- and Software Engineer | Photographer
Stephen Marsh
Community Expert
Community Expert
March 7, 2024

Is there only one duplicate number? 1817?

 

If that is the target number, then perhaps a script layering each file over a single target master file in difference blend mode and then comparing the histogram mean or std. dev. value may identify significant differences been numbers such as 1819 compared to 1817. Duplicate images would have a smaller value than non-duplicates.


A script could then log the value to a .csv file with the filename for examination in a spreadsheet.

 

If there are multiple duplicate sets, then I don't think that this would work.

Inspiring
March 7, 2024

there are multiple duplicate sets.

Stephen Marsh
Community Expert
Community Expert
March 7, 2024

Have you looked into 3rd party software designed to find duplicate files (I realise that this will likely use OCR or ML/AI etc)?