• 4 minute Read

Data Can Lie–Here’s A Guide To Calling Out B.S.

Big data and machine learning are making it easier to B.S. with data, so two scientists made a free syllabus to combat it.

Data Can Lie–Here’s A Guide To Calling Out B.S.

We live in a time awash in bullshit. There’s B.S. of the political kind, a type that’s risen to the forefront of the national conversation over the last year through fake news. But there are also more insidious forms–particularly in the world of big data and machine learning.

According to the University of Washington professors Carl T. Bergstrom and Jevin West, it’s time someone did something about it. Their answer? The Bullshit Syllabus. It’s a free structured course of readings and case studies aimed at giving students (and anyone who might be interested) the tools to look critically at scientific claims driven by data and machine learning. Over the past six months, the two scientists created the syllabus and published it online in the hopes that the UW administration would take notice and turn it into a real class (it’s currently winding its way through the approval process, and might be offered as soon as the spring).

The two have been frustrated with the way statistical findings are treated in the media and in the classroom for years. West, a professor in the Information School and the director of UW’s Data Lab, believes that thanks to the emergence of big data and the increasing availability of tools that help more people work with it, the amount of bullshit appears to have increased; with so much data out there, there is simply more potential for data scientists and designers to shape it to fit their own conclusions–or even intentionally mislead their audience.

While Bergstrom, an evolutionary biologist, believes that “bullshit has been around forever” and is reluctant to say that the levels have increased in a dramatic way, he agrees that it’s incredibly easy these days for bullshit to be taken out of context and go viral–most people don’t take the time to fact-check graphs and data visualizations before sharing them online, like this example that looks at voter turnout but misrepresents the data. Beyond that, he thinks big data might be particularly susceptible to this kind of bullshit. Before big data became a primary research tool, testing a nonsensical hypothesis with a small dataset wouldn’t necessarily lead you anywhere. But with an enormous dataset, he says, there will always be some kind of pattern. “It’s much easier for people to accidentally or deliberately dredge pattern out of all the data,” he says. “I think that’s a bit of a new risk.”

That’s also the case with researchers using machine learning algorithms. An algorithm might give very strong results, Bergstrom says, but for users, it can be hard to know exactly what data the algorithm is pulling from and whether to trust it. For the people writing the algorithms, applying a healthy dose of skepticism to their output is the most responsible way to use them–especially because algorithms are being trained to make decisions and identify people. Can an algorithm really look at a person’s facial features and determine their preponderance for criminality? Yeah, maybe not. But that was the argument of a paper actually published just a few months ago.

“If you look deeper, you find some spurious things driven mostly by how that person was dressed, if they’re frowning or not,” West says. “There’s manipulation of the way the results are reported.” Not to mention that human bias and existing structural inequalities can make algorithms just as flawed as the humans that make them. But it remains the responsibility of designers, as well as scientists and journalists, to think critically about the data they use, especially since the machine-learning designer is one of the most important design jobs of the future.

The root of the problem is a lack of skepticism–something that could have big consequences as designers and developers increasingly use data and algorithms to inform their work. But it also impacts anyone examining a piece of scientific evidence; West and Bergstrom believe this course would be helpful for all UW undergraduates, and hope to bring it to a much wider audience through MOOCs and by partnering with teachers at other universities (and high schools, albeit with a slightly more age-appropriate name).

So how do you combat the bullshit? West and Bergstrom propose a simple set of questions any time you’re looking at a set of results. Think about the source of the information. Who’s telling you this? How does it advance their interests? Find out where they got the information and look at the original source yourself. Is it a credible source? What were the methods used to arrive at the end result? For example, the duo point to a 2004 paper published by Nature that claimed women would run faster than men in the 100-meter dash by 2156. The problem was, the conclusion was reached using a linear regression model, which would mean that by 2636, the times for running the race would be negative. It’s a classic case of over-extrapolation–just because female runners have made large speed gains over the last 100 years doesn’t mean they will continue to do so.

West and Bergstrom themselves aren’t even immune and have used their own material in the case studies section of the course in order to show that this isn’t unique to specific individuals. “Sometimes we can’t even trust ourselves,” West says. “Humans are fallible creatures.”

Now that’s a bullshit antidote. Check out the syllabus for yourself here.

About the author

Katharine Schwab is a contributing writer at Co.Design based in New York who covers technology, design, and culture. Follow her on Twitter @kschwabable.

More

Video

More Stories