Amazon face recognition falsely matches 105 US and UK politicians with police mugshots, but can you trust accuracy claims?

In July 2018, the American Civil Liberties Union conducted a test using Amazon’s face recognition tool, “Rekognition”, to match photos of US Congress members against mugshots of people arrested for a crime. The ACLU found 28 false matches, highlighting the shortcomings of face recognition technology that’s being peddled to law enforcement agencies nationwide.

So, has it gotten any better?

Not much, according to our latest experiment.

Curious as to if and how quickly face recognition is improving, Comparitech decided to conduct a similar study almost two years later. We also added UK politicians into the mix, for a total of 1,959 lawmakers.

Results

We split the results between US and UK politicians. But before we discuss results, let’s first review the fulcrum on which all of these tests pivot: confidence thresholds.

Confidence thresholds

When two images are compared by Amazon’s Rekognition, it doesn’t simply return a yes or no answer. Instead, results are given as percentages. The higher the percentage, the more confident Rekognition is that the two images are of the same person.

The ACLU used Rekognition’s default settings, which set the confidence threshold at 80 percent.

Amazon rebuked the ACLU’s findings, saying the threshold was too low. An Amazon spokesperson told GCN it should be set at at least 95 percent for law enforcement purposes, and a blog post on the Amazon Web Services website stated it should be 99 percent. However, a report by Gizmodo found that it’s up to police discretion to set those thresholds, and they don’t always use Amazon’s recommendations.

Raising the confidence threshold inevitably leads to fewer false positives (incorrectly matching two photos of different people), but also more false negatives (failure to match two photos of the same person). Unfortunately, we can’t measure the latter in this experiment. More on that later.

We contacted both the ACLU and Amazon for comment and will update this article if we receive a response on the record.

US

The US data set was comprised of photos of 430 Representatives and 100 Senators.

At an 80 percent confidence threshold, Rekognition incorrectly matched an average of 32 US Congresspersons to mugshots in the arrest database. That’s four more than the ACLU’s experiment two years ago.

By those standards, Amazon’s face recognition hasn’t improved and even performed worse than what the ACLU posited two years ago.

When we increase the threshold to what Amazon recommends for law enforcement, however, we found no incorrect matches at or above 95 percent confidence. The ACLU did not give results at this threshold back in 2018, so we have no previous results to which we can compare.

UK

Our UK data set consists of 1,429 politicians: 632 Members of Parliament and 797 Members of the House of Lords. We matched them against the same arrest photos as the US politicians.

At an 80 percent confidence threshold, Rekognition misidentified an average of 73 politicians to mugshots in the arrest database.

The rate of false positives was lower for UK politicians (5 percent) than for US ones (13 percent), which might suggest UK politicians look substantially different than their US counterparts, at least according to Rekognition.

When we raised the confidence threshold to 95 percent, there were no incorrect matches.

Racial bias

The ACLU alleged that, at 80 percent confidence threshold, Amazon’s face recognition technology was racially biased, misidentifying non-whites at a higher rate than white people.

Our results support this finding. Out of the 12 politicians who were misidentified at a confidence threshold of 90 percent or higher, six were not white (as shown in the image at the top of this article). That means half of misidentified people were people of color, even though non-whites only make up about one-fifth of US Congress and one-tenth of UK parliament.

Methodology

We used publicly available photos of 430 US Representatives, 100 US Senators, 632 members of UK Parliament, and 797 members of the House of Lords.

These were matched against four sets of 25,000 randomly chosen arrest photos from Jailbase.com using Amazon Rekognition. The experiment was repeated once for each set, and the results averaged together. Because the ACLU did not publish its test data, we could not use the exact same database of arrest photos.

In some instances, a single politician was misidentified more than once against multiple mugshots. This counts as a single false positive.

This spreadsheet contains all of the politicians who matched at or above 70 percent confidence, their photos, and the confidence at which Rekognition matched them.

Why you shouldn’t trust face recognition accuracy statistics

Be skeptical any time a company invested in face recognition peddles metrics about how well it works. The statistics are often opaque and sometimes downright misleading.

Here’s an example of how statistics about face recognition accuracy can be twisted. In the UK, the Met police force claimed its face recognition technology only makes a mistake in one of every 1,000 cases. They reached this number by dividing the number of incorrect matches by the total number of people whose faces were scanned. This inflates the accuracy rating by including true negatives—the vast majority of images that were not matched at all.

In contrast, independent researchers at the University of Essex found the technology had an error rate of 81 percent when they divided the number of incorrect matches by the total number of reported matches. The University’s report is much more in line with how most people would reasonably judge accuracy, disregarding true negatives and focusing on the rate at which reported matches are correct.

A later report found the Met police used live face recognition to scan 8,600 people’s faces without consent in London. The results were in line with the University of Essex’s findings: one correct match leading to an arrest, and seven false positives.

False negatives

Even more seldom reported is the rate of false negatives: two images of the same person that should have been matched, but weren’t. As a hypothetical example of this error in practice, a face recognition-equipped camera at an airport would fail to trigger an alert upon seeing a person it should have recognized. Another form of false negative would be failing to recognize that a face exists in an image at all.

In order to measure the rate of false negatives, we would have to populate our mugshot database with some real—but not identical—photos of the politicians. Because our aim was to recreate the ACLU’s test, this was beyond the scope of our experiment.

Real world use cases

Let’s also consider what we’re comparing: two sets of headshots. One contains police mugshots and the other doctored portraits, but both offer clear views of each person’s face at eye level, facing the camera.

Real world use cases are much different. Let’s take CCTV surveillance for example. Police want to scan faces at an intersection and match them against a criminal mugshot database. Here’s just a few factors that further muddy claims of how well face recognition performs in such a real world setting:

  • How far away is the camera from the subject?
  • At what angle is the camera pointed at the subject?
  • What direction is the subject facing?
  • Is the subject obscured by other humans, objects, or weather?
  • Is the subject wearing makeup, a hat, or glasses, or have they recently shaved?
  • How good is the camera and lens? Is it clean?
  • How fast is the subject moving? Are they blurry?

All of these factors and more affect face recognition accuracy and performance. Even the most advanced face recognition software available can’t make up for poor quality or obscured images.

Putting too much faith in face recognition can lead to false arrests. In April 2019, for example, a student sued Apple after the company’s face recognition software falsely linked him to thefts at several Apple stores, leading to his arrest.

Using a threshold higher than 80% certainly improves results. But whether you agree with police use of face recognition or not, one thing is certain: it isn’t ready to be used for identification without human oversight. Amazon states in its blog post, “In real-world public safety and law enforcement scenarios, Amazon Rekognition is almost exclusively used to help narrow the field and allow humans to expeditiously review and consider options using their judgment (and not to make fully autonomous decisions).”