When the UK first set out to find an alternative to school leaving qualifications, the premise seemed perfectly reasonable. Covid-19 had derailed any opportunity for students to take the exams in person, but the government still wanted a way to assess them for university admission decisions.
Chief among its concerns was an issue of fairness. Teachers had already made predictions of their students’ exam scores, but previous studies had shown that these could be biased on the basis of age, gender, and ethnicity. After a series of expert panels and consultations, Ofqual, the Office of Qualifications and Examinations Regulation, turned to an algorithm. From there, things went horribly wrong.
Nearly 40% of students ended up receiving exam scores downgraded from their teachers’ predictions, threatening to cost them their university spots. Analysis of the algorithm also revealed that it had disproportionately hurt students from working-class and disadvantaged communities and inflated the scores of students from private schools. On August 16, hundreds chanted “Fuck the algorithm” in front of the UK's Department of Education building in London to protest the results. By the next day, Ofqual had reversed its decision. Students will now be awarded either their teacher’s predicted scores or the algorithm’s—whichever is higher.
The debacle feels like a textbook example of algorithmic discrimination. Those who have since dissected the algorithm have pointed out how predictable it was that things would go awry; it was trained, in part, not just on each student’s past academic performance but also on the past entrance-exam performance of the student’s school. The approach could only have led to punishment of outstanding outliers in favor of a consistent average.
But the root of the problem runs deeper than bad data or poor algorithmic design. The more fundamental errors were made before Ofqual even chose to pursue an algorithm. At bottom, the regulator lost sight of the ultimate goal: to help students transition into university during anxiety-ridden times. In this unprecedented situation, the exam system should have been completely rethought.
“There was just a spectacular failure of imagination,” says Hye Jung Han, a researcher at Human Rights Watch in the US, who focuses on children’s rights and technology. “They just didn’t question the very premise of so many of their processes even when they should have.”
At a basic level, Ofqual faced two potential objectives after exams were canceled. The first was to avoid grade inflation and standardize the scores; the second was to assess students as accurately as possible in a way useful for university admissions. Under a directive from the secretary of state, it prioritized the first goal. “I think really that’s the moment that was the problem,” says Hannah Fry, a senior lecturer at University College London and author of Hello World: How to Be Human in the Age of the Machine. “They were optimizing for the wrong thing. Then it basically doesn’t matter what the algorithm is—it was never going to be perfect.”
“There was just a spectacular failure of imagination.”Hye Jung Han
The objective completely shaped the way Ofqual went about pursuing the problem. The need for standardization overruled everything else. The regulator then logically chose one of the best standardization tools, a statistical model, for predicting a distribution of entrance-exam scores for 2020 that would match the distribution from 2019.
Had Ofqual chosen the other objective, things would have gone quite differently. It likely would have scrapped the algorithm and worked with universities to change how the exam grades are weighted in their admissions processes. “If they just looked one step past their immediate problem and looked at what are the purpose of grades—to go to university, to be able to get jobs—they could have flexibly worked with universities and with workplaces to say, ‘Hey, this year grades are going to look different, which means that any important decisions that traditionally were made based off of grades also need to flexible and need to be changed,” says Han.
In fixating on the perceived fairness of an algorithmic solution, Ofqual blinded itself to the glaring inequities of the overall system. “There’s an inherent unfairness in defining the problem to predict student grades as if a pandemic hadn’t happened,” Han says. “It actually ignores what we already know, which is that the pandemic exposed all of these digital divides in education.”
Ofqual’s failures are not unique. In a report published last week by the Oxford Internet Institute, researchers found that one of the most common traps organizations fall into when implementing algorithms is the belief that they will fix really complex structural issues. These projects “lend themselves to a kind of magical thinking,” says Gina Neff, an associate professor at the institute, who coauthored the report. “Somehow the algorithm will simply wash away any teacher bias, wash away any attempt at cheating or gaming the system.”
“I think it’s the first time that an entire nation has felt the injustice of an algorithm simultaneously.”Hannah Fry
But the truth is, algorithms cannot fix broken systems. They inherit the flaws of the systems in which they’re placed. In this case, the students and their futures ultimately bore the brunt of the harm. “I think it’s the first time that an entire nation has felt the injustice of an algorithm simultaneously,” says Fry.
Fry, Neff, and Han all worry that this won’t be the end of algorithmic gaffes. Despite the new public awareness of the problems, designing and implementing fair and beneficial algorithms is frankly really hard.
Nonetheless, they urge organizations to make the most of the lessons learned from this experience. First, return to the objective and critically think about whether it’s the right one. Second, evaluate the structural issues that need to be fixed in order to achieve the objective. (“When the government cancelled the exam in March, that should have been the signal to come up with another strategy to allow a much larger ecology of decision makers to fairly assess student performance,” Neff says.)
Finally, pick a solution that’s easy to understand, implement, and contest, especially in times of uncertainty. In this case, says Fry, that means forgoing the algorithm in favor of teacher-predicted scores: “I’m not saying that’s perfect,” she says, “but it’s at least a simple and transparent system.”