Simpsons paradox

By April 5, 2018Data science

Data science is an exceedingly widely used term these days, and pretty generally misunderstood by non-practitioners.

When working on a data science problem, the data is obviously important as the “fuel” for the problem, but the science is often where the value is.  There are some perhaps obvious exceptions to this, such as visualisation, but even with visualisations, summarising or extracting underlying structure from data so that it can be communicated, is a complex skill.

As with any scientific process, the foundations of solving a data science problem include:

  • formulating a hypothesis to test and a way to test it
  • checking data quality
  • finding hidden patterns in the data
  • understanding the limitation of the tools being used

Take for example “finding hidden patterns”.  Analysis completed by those who lack a scientific understanding can often fail to consider hidden patterns and present misleading results.

Incorrect results, presenting in an engaging or professional way, can be very dangerous.

A classic example of this is Simpsons paradox.  This is a phenomenon in which a trend appears in different groups of data, but disappears or reverses when the groups are combined.  A concerning example of this is included in Categorical Data Analysis [1].  

In this example, analysis of death-penalty sentences over a period of time in Florida were presented and showed that Caucasian defendants were more like to receive a death penalty than African-American defendants, indicating that there was no negative African-American racial bias in sentencing.

Defendant's raceDeathNo deathPercent death

However, when the additional data field of the race of the victim was added, the results were strikingly different.

Defendant's raceVictim's raceDeathNo deathPercent death

There was a significant racial bias if the victim was Caucasian, with African-American defendants being more than twice as likely as Caucasians to receive a death penalty, and no death penalties if the defendant was Caucasian and the victim was African American.

Being aware of the risk of this type of analytic flaw is important to assuring analysis is done correctly.

[1] A. Agresti, Categorical Data Analysis, 2 edition. New York: Wiley-Interscience, 2002.