Hash an image without antialiasing

I am hashing images using MD5 in order to be able to identify duplicate images being uploaded. On the whole this has been working quite well, however I have recently noticed duplicate images with different hashes.

I have two images which appear on the surface to be identical but then compared using an online tool they seems to be different. In colour the images are 0.4% different, and black & white they are 0.1% different, and when “Ignore antialiasing” is selected they are identical.

So what am wanting to know is, how can I generate an image hash without this “antialiasing”?

I have installed ImageMagik on the server and now hashing images like this:

$imagick_type = new Imagick();
$file_to_grab = "60a4ogfrd565no2dfzctfy4gpwaqib1pcct0s9s6ezb2nef3es.jpg";
$file_handle_for_viewing_image_file = fopen($file_to_grab, 'a+');
$imagick_type->readImageFile($file_handle_for_viewing_image_file);
$imagick_type_signature = $imagick_type->getImageSignature();
print($imagick_type_signature);

The 2 images in question though are still returning different hashes.

The purpose of a hash is to tell whether there has been a change made. Even the smallest change can result in a completely different hash. Completely different files may produce the same hash (although this is very unlikely).

To ignore certain types of differences in files you would need to ensure that the process that created the difference is applied to the other copy before generating the hash.

That’s kind of a pointless answer since I don’t have control over the images before they are uploaded. When hashing the image files using MD5 I find some identical images match and some don’t. I have narrowed this down to the antialiasing within the image. Therefore I need a way to hash the image while ignoring the antialiasing. Clearly it’s possible because the online image comparison tool I’ve used can do it.

Antialiasing is modified pixels; how will you ignore them?

Are you comparing uploaded images to other images in a folder so there could be hundreds of images or just comparing one or two?

You can do comparisons with Imagemagick but I do not know if the features are included in Imagick: http://www.imagemagick.org/Usage/compare/

Edit: A section on duplicate images on the above page: http://www.imagemagick.org/Usage/compare/#doubles

2 Likes

I would be comparing hundreds of thousands of images (increasing daily), all in different folders. So really I need to compare them by hash, not by file.

This is the site I used to compare the images, and with “Ignore antialiasing”, the 2 sample images I used were identical.

https://huddle.github.io/Resemble.js/

If that code does what you want why not use it or can you reverse engineer it for what you want to do?

Also they are only comparing two simple images; how good is their code on thousands of images?

Perhaps you need a bit of lateral thought here; what about creating a simple intermediate image of say just the edges and a reduced amount of colours? Although if you have two images the same but one has been resized it would pass so you would need to resize them all to a standard size. But if somebody has cropped one etc…

Basically what I am doing is when an image is uploaded, I want to hash that image and store the hash with SQL. When another image is uploaded, I hash that image too and store that hash within SQL. What I want to be able to do is run a query to find duplicate images based on their hash.

The MD5 has been working okay, but it’s file based not images based which means the hash might change even though the image doesn’t. I want a more image based hash with works regardless of EXIF or AA.

I think Rubble hit on the approach you will need if you are to have any chance of success.

eg.

file uploaded
make (temporary?) “standardized” copy

  • base color palette (grey scale?)
  • base dimension (all have same width or height)
  • no EXIF
  • other?

generate hash of the copy

I would agree, but give me a clue…

At a guess GD or ImageMagik ?

Andy, anti-aliasing is part of the image. It’s not some metadata you can just throw away. If you look at how Resemble.js does it, it turns out they go through the image pixel by pixel, and if any has a high contrasting neighbor, then it decides to ignore that pixel in its comparison. (Which means, by the way, that it will ignore a lot more than just anti-aliasing, as well as miss some real anti-aliasing.)

So what you’re actually asking for is for images that are not exactly the same to nonetheless compare as if they were exactly the same. And, to be frank, this is a huge can of worms. There are many ways an image might look the same to the human eye but nonetheless be technically different. If just one pixel is different by just one bit, our eyes would never notice, but it would technically be a different image with a different hash. And as you constantly try to chase down minute differences, you also run the risk of false positives. That is, you might end up ignoring part of an image that shouldn’t have been ignored.

1 Like

So I either need to come up with a method similar to Resemble.js which is too bigger task I think, or go back to the drawing board?

Actually this had me thinking, what if I just took the first line of pixels and hashed those? I could use ImageMagik’s getImagePixelColor function to grab the colour of pixel 1x1 then 1x2 then 1x3 all the way across the image which will give me a massive list of RGB colours. I stick all those into a string and hash that string. It’s not going to be accurate but I get its more accurate than the MD5 method. Plus if I really wanted to get fancy I could go down either side of the image and across the bottom. In theory this would work right?

AndyUK, you’ve been misunderstood here because the term antialiasing is used in a wrong way - only after looking at the live sample of how the ‘ignore antialiasing’ works it is clear that it is not really about antialiasing - the correct name for this comparison should be ignore image artefacts or ignore JPEG artifacts - because this is what is actually happening - the algorithm is ignoring small differences imposed by compression noise so that two same images saved at different JPEG compression levels are treated as identical.

I imagine doing this kind of comparison would be very difficult or even impossible to achieve with a hash. The images to be hashed would need to be first downgraded to some simpler form (reduced colours, size, etc.) - something that Mittineague suggested above. However, because you are not comparing anything at the moment of image upload, you may not be sure whether a different but very similar image will produce the same hash - because there is some point or threshold in the differences that would result in a different simplified image and therefore a different hash and you never know where that threshold lies. This would be very unreliable because you would not be able to control how much difference you want to allow - because the difference level would be different depending on particular images.

However, when you actually have 2 images to compare then you can use some clever algorithms that will calculate the difference factor for you. For example, you compare pixel by pixel and compute the exact colour difference of each and then sum up all the pixel differences and have a general count of difference factor - to this you can add a rule that pixels close enough in colour be treated as equal, which in effect would lead to a certain leeway in ignoring compression artefacts and you could achieve exactly what you are after. But of course, this would be much slower than just a hash.

Another idea would be to use numbers instead of hashes. In other word, come up with an algorithm that reduces every image to a number that is calculated from the pixels. The goal of this algorithm would be to produce numbers that would describe the images so that the more two images are different the bigger the difference would be between their numbers. Same images would have same numbers. Very similar images would have numbers very close to each other. Different images would have totally different numbers. The numbers could be huge, like BIGINT. Then you could simply compare the numbers in the database like hashes, but numerically - then for example, if the difference is less than 1000 you treat the images as identical. This idea is untested but I’m sure it would work well provided you were able to find or write a good algorithm for changing images into numbers.

It would also be interesting to learn how Google implemented their image search - their algorithm certainly ignores JPEG artefacts and can recognize images of different sizes and I think it can recognize crops, too. I don’t know if this information is freely available but I think they must have some clever mechanisms for that.

Likewise, it would be interesting to know how TinEye does it.

For example, looking for felgall’s avatar gives 5 results. All the “same” image but some with different dimensions and one n grey scale

https://tineye.com/search/30cfb34a4cdda0eff54e6fe8f332167f72e0b087/?pluginver=firefox-1.2.1

Mine gives 3 results, 1 blue and 2 green, one which is slightly smaller

https://tineye.com/search/358bf7125a52c0c9642a7b701afcc97b15eeb62a/?pluginver=firefox-1.2.1

I have just short of a million images right now and this is growing daily, so using an application to do comparison of images wouldn’t work. I would have to take image 1, then compare that against 1 million+ images, then reap this for every image. That would make 1,000,000,000,000 check just based on what I have now.

Today I have tried working with the images on a pixel by pixel basis. I basically tried taking the colour of every pixel within a 10x10 square in the top left of the image and made a calculation based on the colours but I quickly discovered even the 0x0 pixel was different .

Is there any way to find out from the JPG if and what compression is used?

Sometimes - if the generating program writes it to the metadata. But even if you knew the compression level it would be useless in this case because different programs can produce different results for the same compression level because they may be using different algorithms.

BTW, read this article: Detecting similar and identical images using perceptual hashes - this should get you on the right path in your thinking!

1 Like

An interesting article @Lemon_Juice thanks for the link.

As mentioned if @AndyUK is saving a fingerprint of an image a hash is not helpful as you can not use a “tolerance” on what basicaly is a string.

Perfect!

Thanks guys

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.