Copy link to clipboard
Copied
I have high number of GIF files in a folder that each GIF file include a year number like following:
But among the GIFs in that folder, it may have been repeated hundreds of times every year
Now I want to detect dublicate GIF files by photoshop without OCR
Are there any idea to do this?
for example open each GIF file in photoshop and select whole of year number and then save specific characteristics or coordinates of selected section in txt file and then compare all specific characteristics or coordinates in txt file and find dublicate GIFs.
note that don't ask why I want to do this and what is source of this GIF files and why I don't want use OCR
OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned!
I attach 10 number of GIF files to this post for test.
Now I wrote following script that check each two GIF files in a folder in very high speed!
import cv2
import numpy as np
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
import keyboard
def calculate_similarity_percentage(file1, file2):
# Read the GIF files
gif1 = cv2.VideoCapture(file1)
gif2 = cv2.VideoCapture(file2)
# Read the first frame of each GIF
_, frame1 = gif1.read()
_, frame2 = gif2.read()
# Resize frames to a common size
fra
...
Copy link to clipboard
Copied
Is there only one duplicate number? 1817?
If that is the target number, then perhaps a script layering each file over a single target master file in difference blend mode and then comparing the histogram mean or std. dev. value may identify significant differences been numbers such as 1819 compared to 1817. Duplicate images would have a smaller value than non-duplicates.
A script could then log the value to a .csv file with the filename for examination in a spreadsheet.
If there are multiple duplicate sets, then I don't think that this would work.
Copy link to clipboard
Copied
there are multiple duplicate sets.
Copy link to clipboard
Copied
Have you looked into 3rd party software designed to find duplicate files (I realise that this will likely use OCR or ML/AI etc)?
Copy link to clipboard
Copied
no - but last day I read something about python ImageHash that can detect dublicate images! - maybe I must try it.
Copy link to clipboard
Copied
The trouble is the line in your images. They won't create the same hash.
Copy link to clipboard
Copied
I wrote following python script for it and it working good for my job!
import cv2
import numpy as np
def calculate_similarity_percentage(file1, file2):
# Read the GIF files
gif1 = cv2.VideoCapture(file1)
gif2 = cv2.VideoCapture(file2)
# Read the first frame of each GIF
_, frame1 = gif1.read()
_, frame2 = gif2.read()
# Resize frames to a common size
frame1 = cv2.resize(frame1, (640, 480))
frame2 = cv2.resize(frame2, (640, 480))
# Convert frames to grayscale
gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)
# Apply Gaussian blur to the frames
blurred1 = cv2.GaussianBlur(gray1, (15, 15), 0)
blurred2 = cv2.GaussianBlur(gray2, (15, 15), 0)
# Find the absolute difference between the two blurred frames
diff = cv2.absdiff(blurred1, blurred2)
# Threshold the difference image
_, thresholded = cv2.threshold(diff, 30, 255, cv2.THRESH_BINARY)
# Calculate the difference percentage
total_pixels = thresholded.size
non_zero_pixels = cv2.countNonZero(thresholded)
difference_percentage = (non_zero_pixels / total_pixels) * 100
# Calculate the similarity percentage
similarity_percentage = 100 - difference_percentage
return similarity_percentage
# Provide the paths to your GIF files
file_path1 = r'E:\Desktop\Armies\New folder (2)\00004 copy.gif'
file_path2 = r'E:\Desktop\Armies\New folder (2)\00063 copy.gif'
# Calculate and print the similarity percentage between the two GIF files
similarity_percentage = calculate_similarity_percentage(file_path1, file_path2)
print(f"Similarity Percentage: {similarity_percentage:.2f}%")
I wrote it using chatGPT and I will modify it for working with high number of my files!
it is very quick!
The provided script uses image processing techniques to compare two images but does not involve Optical Character Recognition (OCR). Instead, it calculates the similarity percentage based on the differences between two frames from the provided GIF files. Here is a breakdown of the steps:
Reading the GIF files:
Resizing and Converting to Grayscale:
Blurring:
Calculating Absolute Difference:
Thresholding:
Calculating Difference Percentage:
Calculating Similarity Percentage:
Output:
The script essentially compares the visual content of two frames by measuring the difference in pixel values after resizing, blurring, and thresholding. Keep in mind that this method is sensitive to changes in pixel values and may not be suitable for all types of image comparisons, especially when dealing with images that have undergone various transformations or contain text. For more advanced comparisons involving text, OCR or other techniques might be necessary.
Copy link to clipboard
Copied
Now I wrote following script that check each two GIF files in a folder in very high speed!
import cv2
import numpy as np
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
import keyboard
def calculate_similarity_percentage(file1, file2):
# Read the GIF files
gif1 = cv2.VideoCapture(file1)
gif2 = cv2.VideoCapture(file2)
# Read the first frame of each GIF
_, frame1 = gif1.read()
_, frame2 = gif2.read()
# Resize frames to a common size
frame1 = cv2.resize(frame1, (640, 480))
frame2 = cv2.resize(frame2, (640, 480))
# Convert frames to grayscale
gray1 = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
gray2 = cv2.cvtColor(frame2, cv2.COLOR_BGR2GRAY)
# Apply Gaussian blur to the frames
blurred1 = cv2.GaussianBlur(gray1, (15, 15), 0)
blurred2 = cv2.GaussianBlur(gray2, (15, 15), 0)
# Find the absolute difference between the two blurred frames
diff = cv2.absdiff(blurred1, blurred2)
# Threshold the difference image
_, thresholded = cv2.threshold(diff, 30, 255, cv2.THRESH_BINARY)
# Calculate the difference percentage
total_pixels = thresholded.size
non_zero_pixels = cv2.countNonZero(thresholded)
difference_percentage = (non_zero_pixels / total_pixels) * 100
# Calculate the similarity percentage
similarity_percentage = 100 - difference_percentage
return similarity_percentage
def compare_files_chunk(files_chunk, directory):
results = []
for i in range(1, len(files_chunk)):
file_path1 = os.path.join(directory, files_chunk[i - 1])
file_path2 = os.path.join(directory, files_chunk[i])
similarity_percentage = calculate_similarity_percentage(file_path1, file_path2)
results.append(f"Comparison between {files_chunk[i - 1]} and {files_chunk[i]}: {similarity_percentage:.2f}%")
return results
def compare_all_files(directory, output_file):
files = sorted([f for f in os.listdir(directory) if f.lower().endswith('.gif')])
num_files = len(files)
chunk_size = min(num_files, os.cpu_count() * 4) # Adjust the chunk size based on your system's capabilities
with open(output_file, 'w') as f_out, ProcessPoolExecutor() as executor:
futures = []
for i in range(0, num_files, chunk_size):
files_chunk = files[i:i + chunk_size]
future = executor.submit(compare_files_chunk, files_chunk, directory)
futures.append(future)
for future in as_completed(futures):
results = future.result()
f_out.write('\n'.join(results) + '\n')
# Check for 'F6' key press to stop the process
if keyboard.is_pressed('F6'):
print("Process stopped by user.")
return
def main():
directory = r'E:\Desktop\Armies\L1816_2'
output_file = r'E:\Desktop\Armies\comparison_results.txt'
try:
compare_all_files(directory, output_file)
print("Comparison completed. Results saved in:", output_file)
except Exception as e:
print(f"Error: {str(e)}")
if __name__ == "__main__":
main()
Copy link to clipboard
Copied
why I don't want use OCR
OCR is very weak for about 13000 number of GIF files and it may not be accurate or leave some GIFs unscanned!
By @Pubg32486011zfgs
Well, if you want a real accurate way to do that, without OCR, I would say you will need to check each file and note the year in an Excel file. It will be the only accurate method. I suppose your file names are continuous, and your files are not always the same (line). That excludes the very practical option of calculating a checksum.
With 13000 files and a minute per file (which would be a lot), you could do it in a month time (less, if you have more people working for you).
Copy link to clipboard
Copied
How many different years are there? If you have only several years, you could create a mask for each year, you mask out the lines, save the resulting file as a PNG. Those with the same size are probably the same.
Copy link to clipboard
Copied
from 1816 to 2023
Copy link to clipboard
Copied
It would be faster to recreate those files.