Data Scientists and Their Secret Stew

People misuse the word “literally” all the time, and that’s literally one of my biggest pet peeves. But I could make an exception when it comes to describing what we’re doing with data. We’re literally drowning in data. By 2025, some estimate we’ll have a global store of 175 Zettabytes worth of the stuff. That’s about 175 billion of the largest hard drives typically found in computers today.

Data scientists work to turn this data into information Information is data in context. Once data has a context, we can extract its meaning. To give information meaning, data scientists look for patterns in the data. They try to classify and characterize it. When they do, they end up with knowledge, the Holy Grail of modern data science. Much of today’s practice of data science aims at turning information into knowledge efficiently and accurately.

What data scientists lack – and where machines still stumble – is the ability to explain why the knowledge they have uncovered is true. In other words, computers can’t readily understand what they know; they can only know it. And, because they lack such understanding, they can’t exhibit wisdom: the ability to apply what they know to make appropriate decisions. For example, we might ask Alexa to cure cancer, or to reduce health care costs, or to halt global warming. Alexa might know that humans contract cancer, the treatment for which drives up health care costs, and that health care facilities and drug production both consume fuels that pollute the environment and contribute to global warming. What more efficient way to fix all three problems than to get rid of humans? A mobilized Alexa army could very well do that one day. But would that be the wisest choice for Alexa to make? After all, who would then be left to purchase Amazon Prime subscriptions? Eradicating mankind is a solution, but clearly not a wise one.

Machines fall short in the understanding and wisdom departments because data scientists thus far have been unable to come up with models comprehensive enough to address the essential nuances of any but the simplest of problems. If we could account for all variables and address all outliers, then there would, by definition, be no outliers, and the knowledge we extract from the contextualized data could, in all cases, be explained and acted upon. Instead, the data models we devise end up being nothing more than opinions expressed in math. Like all opinions, they are informed partially by fact, but they also reflect the influences of other opinions, stereotypes, preconceived notions, confirmation bias, and news, both real and fake. We can include all of these influences as factors to consider and weigh them against other factors to create the mathematical stew we call our model. The ingredients in that stew include a subset of the 175 Zettabytes of data we mentioned before, combined in just the right mix that we – the chefs – might be the only ones to know. This is our stew after all. What emerges from this culinary exercise is a nourishing mixture with enough caloric content to sustain us through what we need to do.

Sometimes, though, we’re missing some of the ingredients, and we use substitutes. These proxy ingredients help stretch the recipe so we can still make the stew, but they tend to lead to an inferior product, one that lacks freshness or has a peculiar taste or that might offend some people’s palates. The proxies might even trigger allergies and result in a mixture that makes some of us fall ill. It still gives us the fuel we need to get us through, but some of us might suffer afterward because of the adverse effects of consuming ingredients that weren’t entirely appropriate. Not to be gross, but some of us might really suffer. But the stew was all we had to eat, so we keep coming back for more, even if we know the after effects might not be so desirable.

Modern data science is a lot like this. Although so much data exists, the vast majority of it hasn’t been analyzed and organized yet. Instead, it exists in the cloud as an amorphous mass of unexplored electrons. And even some of the data we have collected and organized can’t be used, perhaps for legal reasons because they include personally identifiable information or might address racial or demographic information that, by law, cannot be used when making decisions about issues that affect people. Still, our recipe – our model – requires some understanding of these missing data, and so we employ substitutes. In other words, we are often forced to use spam in place of real meat. Everyone knows it’s not real meat, especially the cooks, and it might not even taste remotely like it. But it allows us to put the stew on the table. It gives us a model we can use to make the decisions we need to make.

In practice, there are many examples of inappropriate models being used to shape decisions on critically important matters. In Florida, for example, drivers with a clean driving record but a poor credit score might pay nearly $1,600 more for car insurance than those with a good credit score and a drunk driving conviction. The mathematical model used to price insurance apparently uses financial difficulty as an indicator of overall recklessness and helps set rates accordingly. For another example, consider recidivism, the tendency of incarcerated people to return to prison for later offenses. Models have been employed that consider the neighborhood where the indicted person lives along with survey results in which the person reveals the first time he or she interacted with law enforcement. People from more heavily policed communities or ones in which there is a lot of friction between citizens and law enforcement are more likely to interact with law enforcement earlier and more frequently, and so they would be penalized by this model based purely on where they live. Unfortunately, the stakes are high: if a person fares poorly by this model, they are sentenced to a longer prison term to avoid what the model considers the likeliness of their eventually returning to jail. Such models penalize the poor, the disenfranchised, and ethnic and racial minorities, even though it is illegal to use race and demographics in sentencing decisions. Indeed, it seems algorithms can be racist, that math can discriminate.

Data scientists must stop contenting themselves with inferior ingredients, because the models that result have tremendous capacity to harm real people. When poor models fail to capture the nuances of a problem, to reflect the diversity of stakeholders impacted by it, or to address the situations of people who don’t quite fit the assumed mold, they spread injustice, racism, and discrimination with a scale and level of damage made all the more fearsome by the Internet’s ubiquity. Just because these ills might have the power of math behind them doesn’t make them any less toxic. In fact, in giving them fake legitimacy, math makes them more so.

Data scientists must openly share their models, clarify their shortcomings and substitutes, and doggedly seek to improve them so as not to leave some people repeatedly harmed. This is a responsibility as critical as that of a doctor to her patients, for the potential consequences are just as dire. In fact, some have argued for a Hippocratic Oath for Data Scientists, a series of promises that commit them to share their models and methods, to be open about the data they are using, to limit themselves only to the data people have expressly granted permission for them to use, to keep the data secure and private and limited, and to adjust the model continuously as outliers emerge so that the it may consider them fully and fairly as humans worthy of justice. Given the immense power and responsibility data scientists hold in today’s data-obsessed world, this seems like a logical and even necessary development.

A great way for data scientists to declare their commitment to greater openness, fairness, and accuracy is to sign the Global Data Ethics Pledge. It formalizes a commitment to the FORTS Framework: Fairness, Openness, Reliability, Trust, and Social Benefit. One of the most important components of the pledge is this: “I place people before data and am responsible for maximizing social benefit and minimizing harm.” In other words, I’m going to use only the best, most nutritious ingredients in this stew I’m making, and I’ll let everyone know exactly what’s in it before they grab a spoon. That way, no one will get sick from it, and everyone will feel stronger because of it. Data scientists must strive to cook up nothing less.

About Ray Klump

Professor and chair of Mathematics and Computer Science Director, Master of Science in Information Security Lewis University http://online.lewisu.edu/ms-information-security.asp, http://online.lewisu.edu/resource/engineering-technology/articles.asp, http://cs.lewisu.edu. You can find him on Google+.

Leave a Reply

Your email address will not be published. Required fields are marked *