Detect dublicate gif files without OCR

Report · Mar 07, 2024

I have high number of GIF files in a folder that each GIF file include a year number like following:

But among the GIFs in that folder, it may have been repeated hundreds of times every year

Now I want to detect dublicate GIF files by photoshop without OCR

Are there any idea to do this?

for example open each GIF file in photoshop and select whole of year number and then save specific characteristics or coordinates of selected section in txt file and then compare all specific characteristics or coordinates in txt file and find dublicate GIFs.

note that don't ask why I want to do this and what is source of this GIF files and why I don't want use OCR

OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned!

I attach 10 number of GIF files to this post for test.

Report · Mar 07, 2024

Is there only one duplicate number? 1817?

If that is the target number, then perhaps a script layering each file over a single target master file in difference blend mode and then comparing the histogram mean or std. dev. value may identify significant differences been numbers such as 1819 compared to 1817. Duplicate images would have a smaller value than non-duplicates.

A script could then log the value to a .csv file with the filename for examination in a spreadsheet.

If there are multiple duplicate sets, then I don't think that this would work.

Report · Mar 07, 2024

there are multiple duplicate sets.

Report · Mar 07, 2024

Have you looked into 3rd party software designed to find duplicate files (I realise that this will likely use OCR or ML/AI etc)?

Report · Mar 07, 2024

no - but last day I read something about python ImageHash that can detect dublicate images! - maybe I must try it.

Report · Mar 08, 2024

The trouble is the line in your images. They won't create the same hash.

ABAMBO | Hard- and Software Engineer | Photographer

Report · Mar 08, 2024

I wrote following python script for it and it working good for my job!

import cv2
import numpy as np

def calculate_similarity_percentage(file1, file2):
    # Read the GIF files
    gif1 = cv2.VideoCapture(file1)
    gif2 = cv2.VideoCapture(file2)

    # Read the first frame of each GIF
    _, frame1 = gif1.read()
    _, frame2 = gif2.read()

    # Resize frames to a common size
    frame1 = cv2.resize(frame1, (640, 480))
    frame2 = cv2.resize(frame2, (640, 480))

    # Convert frames to grayscale
    gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
    gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

    # Apply Gaussian blur to the frames
    blurred1 = cv2.GaussianBlur(gray1, (15, 15), 0)
    blurred2 = cv2.GaussianBlur(gray2, (15, 15), 0)

    # Find the absolute difference between the two blurred frames
    diff = cv2.absdiff(blurred1, blurred2)

    # Threshold the difference image
    _, thresholded = cv2.threshold(diff, 30, 255, cv2.THRESH_BINARY)

    # Calculate the difference percentage
    total_pixels = thresholded.size
    non_zero_pixels = cv2.countNonZero(thresholded)
    difference_percentage = (non_zero_pixels / total_pixels) * 100

    # Calculate the similarity percentage
    similarity_percentage = 100 - difference_percentage

    return similarity_percentage

# Provide the paths to your GIF files
file_path1 = r'E:\Desktop\Armies\New folder (2)\00004 copy.gif'
file_path2 = r'E:\Desktop\Armies\New folder (2)\00063 copy.gif'

# Calculate and print the similarity percentage between the two GIF files
similarity_percentage = calculate_similarity_percentage(file_path1, file_path2)
print(f"Similarity Percentage: {similarity_percentage:.2f}%")

I wrote it using chatGPT and I will modify it for working with high number of my files!
it is very quick!

The provided script uses image processing techniques to compare two images but does not involve Optical Character Recognition (OCR). Instead, it calculates the similarity percentage based on the differences between two frames from the provided GIF files. Here is a breakdown of the steps:

Reading the GIF files:
- It reads the GIF files using OpenCV's VideoCapture function.
Resizing and Converting to Grayscale:
- It resizes the first frame of each GIF to a common size (640x480 pixels).
- Converts the resized frames to grayscale using cv2.cvtColor().
Blurring:
- Applies Gaussian blur to the grayscale frames using cv2.GaussianBlur().
Calculating Absolute Difference:
- Finds the absolute difference between the two blurred frames using cv2.absdiff().
Thresholding:
- Thresholds the difference image to create a binary image using cv2.threshold().
Calculating Difference Percentage:
- Counts the non-zero pixels in the thresholded image using cv2.countNonZero().
- Calculates the difference percentage based on the total and non-zero pixels.
Calculating Similarity Percentage:
- Calculates the similarity percentage as 100 minus the difference percentage.
Output:
- Prints the calculated similarity percentage between the two GIF files.

The script essentially compares the visual content of two frames by measuring the difference in pixel values after resizing, blurring, and thresholding. Keep in mind that this method is sensitive to changes in pixel values and may not be suitable for all types of image comparisons, especially when dealing with images that have undergone various transformations or contain text. For more advanced comparisons involving text, OCR or other techniques might be necessary.

Report · Mar 08, 2024

Now I wrote following script that check each two GIF files in a folder in very high speed!

import cv2
import numpy as np
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
import keyboard

def calculate_similarity_percentage(file1, file2):
    # Read the GIF files
    gif1 = cv2.VideoCapture(file1)
    gif2 = cv2.VideoCapture(file2)

    # Read the first frame of each GIF
    _, frame1 = gif1.read()
    _, frame2 = gif2.read()

    # Resize frames to a common size
    frame1 = cv2.resize(frame1, (640, 480))
    frame2 = cv2.resize(frame2, (640, 480))

    # Convert frames to grayscale
    gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
    gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)

    # Apply Gaussian blur to the frames
    blurred1 = cv2.GaussianBlur(gray1, (15, 15), 0)
    blurred2 = cv2.GaussianBlur(gray2, (15, 15), 0)

    # Find the absolute difference between the two blurred frames
    diff = cv2.absdiff(blurred1, blurred2)

    # Threshold the difference image
    _, thresholded = cv2.threshold(diff, 30, 255, cv2.THRESH_BINARY)

    # Calculate the difference percentage
    total_pixels = thresholded.size
    non_zero_pixels = cv2.countNonZero(thresholded)
    difference_percentage = (non_zero_pixels / total_pixels) * 100

    # Calculate the similarity percentage
    similarity_percentage = 100 - difference_percentage

    return similarity_percentage

def compare_files_chunk(files_chunk, directory):
    results = []
    for i in range(1, len(files_chunk)):
        file_path1 = os.path.join(directory, files_chunk[i - 1])
        file_path2 = os.path.join(directory, files_chunk[i])
        similarity_percentage = calculate_similarity_percentage(file_path1, file_path2)
        results.append(f"Comparison between {files_chunk[i - 1]} and {files_chunk[i]}: {similarity_percentage:.2f}%")

    return results

def compare_all_files(directory, output_file):
    files = sorted([f for f in os.listdir(directory) if f.lower().endswith('.gif')])
    num_files = len(files)
    chunk_size = min(num_files, os.cpu_count() * 4)  # Adjust the chunk size based on your system's capabilities

    with open(output_file, 'w') as f_out, ProcessPoolExecutor() as executor:
        futures = []

        for i in range(0, num_files, chunk_size):
            files_chunk = files[i:i + chunk_size]
            future = executor.submit(compare_files_chunk, files_chunk, directory)
            futures.append(future)

        for future in as_completed(futures):
            results = future.result()
            f_out.write('\n'.join(results) + '\n')

            # Check for 'F6' key press to stop the process
            if keyboard.is_pressed('F6'):
                print("Process stopped by user.")
                return

def main():
    directory = r'E:\Desktop\Armies\L1816_2'
    output_file = r'E:\Desktop\Armies\comparison_results.txt'

    try:
        compare_all_files(directory, output_file)
        print("Comparison completed. Results saved in:", output_file)

    except Exception as e:
        print(f"Error: {str(e)}")

if __name__ == "__main__":
    main()

Report · Mar 07, 2024

why I don't want use OCR

OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned!

By @Pubg32486011zfgs

Well, if you want a real accurate way to do that, without OCR, I would say you will need to check each file and note the year in an Excel file. It will be the only accurate method. I suppose your file names are continuous, and your files are not always the same (line). That excludes the very practical option of calculating a checksum.

With 13000 files and a minute per file (which would be a lot), you could do it in a month time (less, if you have more people working for you).

ABAMBO | Hard- and Software Engineer | Photographer

Report · Mar 07, 2024

How many different years are there? If you have only several years, you could create a mask for each year, you mask out the lines, save the resulting file as a PNG. Those with the same size are probably the same.

ABAMBO | Hard- and Software Engineer | Photographer

Report · Mar 07, 2024

from 1816 to 2023

Report · Mar 08, 2024

It would be faster to recreate those files.

ABAMBO | Hard- and Software Engineer | Photographer

Detect dublicate gif files without OCR

1 Correct answer

Explore related tutorials & articles