ImageNet contains naturally occurring Apple NeuralHash collisions

criticaltinker 2021-08-19 17:00:41 +0000 UTC [ - ]

> In order to test things, I decided to search the publicly available ImageNet dataset for collisions between semantically different images. I generated NeuralHashes for all 1.43 million images and searched for organic collisions. By taking advantage of the birthday paradox, and a collision search algorithm that let me search in n(log n) time instead of the naive n^2, I was able to compare the NeuralHashes of over 2 trillion image pairs in just a few hours.

> This is a false-positive rate of 2 in 2 trillion image pairs (1,431,168^2). Assuming the NCMEC database has more than 20,000 images, this represents a slightly higher rate than Apple had previously reported. But, assuming there are less than a million images in the dataset, it's probably in the right ballpark.

It's great to see the ingenuity and attention this whole debacle is receiving from the community. Maybe it will lead to advances in perceptual hashing (and also advances in consumer awareness of tech related privacy issues).

bawolff 2021-08-19 17:06:49 +0000 UTC [ - ]

ImageNet is a very well-known data set. Are we sure apple didn't test on it when designing this algorithm?

endisneigh 2021-08-19 17:06:02 +0000 UTC [ - ]

It doesn’t matter if there are collisions if the two images don’t actually look the same. Do people honestly believe a single CSAM flag from an “innocent” image is going to result in someone going to prison in America?

PhotoDNA has existed for over a decade years doing the same thing with no instances that I have heard of.

If some corrupt government wants to get you they don’t need this. They can just unilaterally say you’ve done something bad without evidence and imprisonment you. It happens all the time. It’s even happened in America. Just look up DNA exonerations - people have had DNA on the scene that literally proves their innocence and they’re still locked up.

bawolff 2021-08-19 17:08:46 +0000 UTC [ - ]

If the two images looked the same, then the expected behaviour is a collision, so if collisions matter at all, it would only be for pictures that look different.

minitoar 2021-08-19 17:08:04 +0000 UTC [ - ]

The damage is already done by the time it gets to the point of devices being confiscated.

endisneigh 2021-08-19 17:09:00 +0000 UTC [ - ]

Can you point to a single instance of that happening where it was due to a false positive?

alfalfasprout 2021-08-19 16:59:21 +0000 UTC [ - ]

There's been a lot of focus on the likelihood of collisions and whether someone could upload eg; an image with a matching hash to your device to "set you up", etc. But what's still extremely concerning is that there is still no guarantee that the hash list used can't be coopted for another purpose (eg; politically insensitive content).

criticaltinker 2021-08-19 17:03:32 +0000 UTC [ - ]

The OP mentions that two countries have to agree to add a file to the list, but you're concern is definitely valid:

> Perhaps the most concerning part of the whole scheme is the database itself. Since the original images are (understandably) not available for inspection, it's not obvious how we can trust that a rogue actor (like a foreign government) couldn't add non-CSAM hashes to the list to root out human rights advocates or political rivals. Apple has tried to mitigate this by requiring two countries to agree to add a file to the list, but the process for this seems opaque and ripe for abuse.

cat199 2021-08-19 17:05:51 +0000 UTC [ - ]

this also punts the debate to the checking process instead of the fact that there even is a process to start with..

version_five 2021-08-19 17:08:37 +0000 UTC [ - ]

Yes I was about to say the same thing. Hash collisions are an extra concern about what apple is doing, but even if they were as collision free as cryptographic hashes, that would not make the invasion of privacy OK. The technical discussion is something that apple can easily parry and is the wrong framing of this debate.

matsemann 2021-08-19 17:03:37 +0000 UTC [ - ]

> By taking advantage of the birthday paradox, and a collision search algorithm that let me search in n(log n) time instead of the naive n^2

Someone got more details on that? How does the birthday paradox come into play here?

2021-08-19 17:07:41 +0000 UTC [ - ]

JimBlackwood 2021-08-19 17:04:16 +0000 UTC [ - ]

It is actually really cool to see that those small changes generate the same hash. At least those parts are working well.