IBM’s photo-scraping scandal shows what a weird bubble AI researchers live in

Karen Haoarchive page

March 15, 2019

On Tuesday, NBC published a story with a gripping headline: “Facial recognition’s ‘dirty little secret’: Millions of online photos scraped without consent.” I linked to it in our last Algorithm issue, but it’s worth a revisit today.

The story highlights a recent data set released by IBM with 1 million pictures of faces, intended to help develop fairer face recognition algorithms. (I wrote about the news at the time too.) It turns out, NBC found, that those faces were scraped directly from the online photo-hosting site Flickr, without the permission of the subjects or photographers.

For some of you, this practice will immediately feel creepy and weird. For others, it will seem perfectly normal. What this story exposed was not so much a “dirty little secret” but, rather, the cultural gulf between the public and the AI community.

Really, for industry insiders, IBM did nothing out of the ordinary. AI researchers hoover up data from various corners of the internet all the time to feed the ever-hungry machine-learning algorithms that require massive amounts of it to train. Instagram photos, for example, are a common source of image data; the hashtags often conveniently correspond to the content of the photos, making it extra easy to generate labeled data. New York Times and Wall Street Journal articles are also a common source of data for well-written, copy-edited sentences. Even better that they are categorized by topic: technology, business, sports.

In fact, scraping data from publicly available sources is so much of an industry standard that it’s taught as a foundational skill (sans ethics) in most data science and machine-learning training. Meanwhile, most tech platforms are designed to invite such scraping by offering APIs with direct access to their data. Until recently, this was done without second thought. (Hello, Facebook.)

All of this isn’t to say that scraping data is right or wrong. There are some totally benign or legitimate ways in which the practice can be used (See “We analyzed 16,625 papers to figure out where AI is headed next”); it really depends on the context. Rather, this story highlights the need for the tech industry to adapt its cultural norms and standard practices to keep pace with the rapid evolution of the technology itself, as well as the public’s awareness of how their data is used.

“There are ways to use our data today that we were not aware of five, 10 years ago,” says Rumman Chowdhury, the global lead for responsible AI at Accenture Applied Intelligence. “How could we [the public] possibly have agreed to a capability that did not exist?”

In other words, it may have once been a viable practice to blithely scrape people’s data, and it may have once been consent enough for said data to be used as long as it was publicly available, but the advent of AI and the unprecedented scale of Silicon Valley’s data monopolization and monetization have all changed the equation. Technologists bear the responsibility of changing with it and making sure there is a broad, informed societal consensus for their practices.

Chowdhury’s tip to those struggling to navigate the gray areas of data privacy? Think about whether the way you’re using data is in the spirit in which it was originally generated and shared. If you are using it in a completely tangential way, it’s time to pause and reconsider.

This story originally appeared in our AI newsletter The Algorithm. To have it directly delivered to your inbox, sign up here for free.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.