In my previous post, “Challenging Evidence and Conclusions in Data Science,” I encourage data science teams to be skeptical of any claims or evidence that supports those claims, and I provide several techniques for challenging claims and evidence.
However, missing data can be just as misleading as wrong data, if not more so. One of the big problems with missing data is that people can’t see what’s not there. When you have data, you can check for errors and validate it. With missing data, you have nothing to check. You may not even think to ask about it or look for it.
For example, suppose you see the following graph with the headline: “Major Heat Wave in Atlanta!”
Your initial reaction might be that temperatures are rising precipitously in Atlanta and something must be done to reverse this dangerous trend. What’s missing from this graph? The months along the horizontal axis: January through July. Of course monthly temperatures are going to rise dramatically over the spring and summer months!
I once worked for an organization that was trying to figure out why more men than women were participating in their medication trials. A report from the company’s labs showed that 60 percent of its study participants were men compared to only 40 percent who were women. The data science team was assigned the job of finding out why men are more likely to participate in the company’s medication studies than women.
When team members received this report, they asked, “What significant information are we missing?” “What does it mean that men are more likely than women to participate?” Does that mean that more men applied or that equal numbers of men and women applied but that a greater number of men were accepted? Or does it mean that equal numbers of men and women applied and were accepted but more men actually participated?
This additional data would shift the team’s exploration in different directions. If more men applied, the next question would be “Why are men more likely than women to apply for our medication studies?” If equal numbers of men and women applied but more men were accepted, the next question would be “Why are more men being accepted?” or “Why are more women being rejected?” If equal numbers of men and women applied and were accepted but more men actually participated, the next question would be “Why are men more likely to follow through?” As you can see, the missing data has a significant impact on where the team directs its future exploration.
When you encounter a scenario like this, consider both what data might be missing and why it might be missing:
This last question turned out to be significant. The benefit to having more women participate in the company’s studies is that young women are more likely to be on prescription medication, which would make the studies more comprehensive. The medication studies would be able to test for a greater number of drug interactions. The flip side is that many women couldn’t participate because they were taking a prescription medication that prohibited them from participating in the study. The statistic could then be rephrased as “60 percent of those who are allowed to participate in our medication studies are men.” This tells an entirely different story.
Data science teams need to remain vigilant regarding missing information. If a claim seems too good or too bad to be true, the team needs to question it and ask, “What’s the rest of the story? What’s missing? What’s been omitted, intentionally or not?” The team also should always be asking, “Do we have all the relevant data?”