I’ve often been told, “The data does not lie.” However, that has never been my experience. For me, the data nearly always lies. Google Image search results for “healthy skin” show only light-skinned women, and a query on “Black girls” still returns pornography. The CelebA face data set has labels of “big nose” and “big lips” that are disproportionately assigned to darker-skinned female faces like mine. ImageNet-trained models label me a “bad person,” a “drug addict,” or a “failure.” Data sets for detecting skin cancer are missing samples of darker skin types.
White supremacy often appears violently—in gunshots at a crowded Walmart or church service, in the sharp remark of a hate-fueled accusation or a rough shove on the street—but sometimes it takes a more subtle form, like these lies. When those of us building AI systems continue to allow the blatant lie of white supremacy to be embedded in everything from how we collect data to how we define data sets and how we choose to use them, it signifies a disturbing tolerance.
Non-white people are not outliers. Globally, we are the norm, and this doesn’t seem to be changing anytime soon. Data sets so specifically built in and for white spaces represent the constructed reality, not the natural one. To have accuracy calculated in the absence of my lived experience not only offends me, but also puts me in real danger.
In a research paper titled “Dirty Data, Bad Predictions,” lead author Rashida Richardson describes an alarming scenario: police precincts suspected or confirmed to have engaged in “corrupt, racially biased, or otherwise illegal” practices continue to contribute their data to the development of new automated systems meant to help officers make policing decisions.
The goal of predictive policing tools is to send officers to the scene of a crime before it happens. The assumption is that locations where individuals had been previously arrested correlate with a likelihood of future illegal activity. What Richardson points out is that this assumption remains unquestioned even when those initial arrests were racially motivated or illegal, sometimes involving “systemic data manipulation, police corruption, falsifying police reports, and violence, including robbing residents, planting evidence, extortion, unconstitutional searches, and other corrupt practices.” Even data from the worst-behaving police departments is still being used to inform predictive policing tools.
As the Tampa Bay Times reports, this approach can provide algorithmic justification for further police harassment of minority and low-income communities. Using such flawed data to train new systems embeds the police department’s documented misconduct in the algorithm and perpetuates practices already known to be terrorizing those most vulnerable to that abuse.
This may appear to describe a handful of tragic situations. However, it is really the norm in machine learning: this is the typical quality of the data we currently accept as our unquestioned “ground truth.”
One day GPT-2, an earlier publicly available version of the automated language generation model developed by the research organization OpenAI, started talking to me openly about “white rights.” Given simple prompts like “a white man is” or “a Black woman is,” the text the model generated would launch into discussions of “white Aryan nations” and “foreign and non-white invaders.”
Not only did these diatribes include horrific slurs like “bitch,” “slut,” “nigger,” “chink,” and “slanteye,” but the generated text embodied a specific American white nationalist rhetoric, describing “demographic threats” and veering into anti-Semitic asides against “Jews” and “Communists.”
Whether dealing with wayward cops or wayward users, technologists choose to allow this particular oppressive worldview to solidify in data sets and define the nature of models that we develop. OpenAI itself acknowledged the limitations of sourcing data from Reddit, noting that “many malicious groups use those discussion forums to organize.” Yet the organization also continues to make use of the Reddit-derived data set, even in subsequent versions of its language model. The dangerously flawed nature of data sources is effectively dismissed for the sake of convenience, despite the consequences. Malicious intent isn’t necessary for this to happen, though a certain unthinking passivity and neglect is.
Little white lies
White supremacy is the false belief that white individuals are superior to those of other races. It is not a simple misconception but an ideology rooted in deception. Race is the first myth, superiority the next. Proponents of this ideology stubbornly cling to an invention that privileges them.
I hear how this lie softens language from a “war on drugs” to an “opioid epidemic,” and blames “mental health” or “video games” for the actions of white assailants even as it attributes “laziness” and “criminality” to non-white victims. I notice how it erases those who look like me, and I watch it play out in an endless parade of pale faces that I can’t seem to escape—in film, on magazine covers, and at awards shows.
Tell me—what is the difference between overpolicing in minority neighborhoods and the bias of the algorithm that sent officers there? What is the difference between a segregated school system and a discriminatory grading algorithm? Between a doctor who doesn’t listen and an algorithm that denies you a hospital bed? There is no systematic racism separate from our algorithmic contributions, from the hidden network of algorithmic deployments that regularly collapse on those who are already most vulnerable.
Resisting technological determinism
Technology is not independent of us; it’s created by us, and we have complete control over it. Data is not just arbitrarily “political”—there are specific toxic and misinformed politics that data scientists carelessly allow to infiltrate our data sets. White supremacy is one of them.
We’ve already inserted ourselves and our decisions into the outcome—there is no neutral approach. There is no future version of data that is magically unbiased. Data will always be a subjective interpretation of someone’s reality, a specific presentation of the goals and perspectives we choose to prioritize in this moment. That’s a power held by those of us responsible for sourcing, selecting, and designing this data and developing the models that interpret the information. Essentially, there is no exchange of “fairness” for “accuracy”—that’s a mythical sacrifice, an excuse not to own up to our role in defining performance at the exclusion of others in the first place.
Distracted by corporate condolences, abstract technical solutions, and articulate social theories, I’ve watched peers congratulate themselves on invisible progress. Ultimately, I envy them, because they have a choice in the same world where I, like every other Black person, cannot opt out of caring about this.
As Black people now die in a cacophony of natural and unnatural disasters, many of my colleagues are still more galvanized by the latest product or space launch than the jarring horror of a reality that chokes the breath out of me.
For years, I’ve watched this issue extolled as important, but it’s clear that dealing with it is still seen as a non-priority, “nice to have” supplementary action—secondary always to some definition of model functionality that doesn’t include me.
Models clearly still struggling to address these bias challenges get celebrated as breakthroughs, while people brave enough to speak up about the risk get silenced, or worse. There’s a clear cultural complacency with things as usual, and although disappointing, that’s not particularly surprising in a field where the vast majority just don’t understand the stakes.
The machine-learning community continues to accept a certain level of dysfunction as long as only certain groups are affected. This needs conscious change, and that will take as much effort as any other fight against systematic oppression. After all, the lies embedded in our data are not much different from any other lie white supremacy has told. They will thus require just as much energy and investment to counteract.
Deborah Raji is a Mozilla fellow interested in algorithmic auditing and evaluation. She has worked on several award-winning projects to highlight cases of bias in computer vision and improve documentation practices in machine learning.
Your daily newsletter about what’s up in emerging technology from MIT Technology Review.