The latest issue of the *American Statistician* has a set of thought-provoking point/counterpoint papers on Simpson’s Paradox, with a tie-in to the controversial issue of *causality*. (I will not address the causality issue here.) Since I have long had my own thoughts about Simpson’s, I’ll postpone the topic I had planned to post this week, and address Simpson’s. Much of the material here will be adapted from my open-source textbook on probability and statistics. I’ll use R code to perform the analysis.

As readers undoubtedly already know, Simpson’s Paradox describes a situation in which variables X and Y are positively related overall, but suddenly become negatively related when conditioned on a third variable Z. (Or vice versa.) The clever title of a medical journal paper, “Good for Women, Good for Men, Bad for People,” neatly sums it up.

I’ve always contended that Simpson’s is not a paradox in the first place, and instead is simply the effect of using variables in the wrong order. I mean this in a sense similar to forward stepwise regression; I’m not a fan of stepwise as a variable selection procedure, but my point is that I regard Simpson’s so-called “Paradox” as simply being a problem arising from adding a minor factor to a model before including a major factor. This is in contrast to forward stepwise regression, which aims to add the most important factors first.

To illustrate this, let’s consider what is probably the most famous example of Simpson’s, the UC Berkeley graduate admissions data. This concerns a lawsuit that asserted that UCB discriminated against women in admissions to graduate school. In short: Though the data provided in the suit showed that male applicants had a 44% acceptance rate, compared to only 35% for women, some UCB statistics faculty showed that when a third variable is added to the analysis, things change entirely; in fact, this more probing analysis indicates that actually women were faring slightly better than men in the admissions process.

The problem turned out to be that women tended to apply to the more selective programs at UCB. In applying to departments with lower admissions rates, the women misleading looked to be the subject of bias. Once the department variable is taken into account, the problem is cleared up.

But how might the “paradox” have been avoided in the first place? It will be shown here that (if the relevant variables are there), viewing them in the proper order can be very helpful.

So let’s start up R and take a look. Actually, the UCB data is so famous that it is included in R! Let’s load the data:

> ucb <- UCBAdmissions

Our X, Y and Z here will be Admit (“Admitted” or “Rejected”), Gender (“Male” or “Female”) and Dept. In that last case, six departments are included, referred to as A, B, C, D, E and F.

The data is provided as an R **table**, which is great, as the R function **loglin()** expects its input in this form. So we can use R’s **table** capabilities, for instance determining the total number of observations in the data set:

> sum(ucb) [1] 4526

We’ll use the log-linear model methodology. Again see my open-source textbook if you are not familiar with this approach, but here is a quick summary, in the present context:

Let p_{ijk} denote the population probability of cell (i,j,k), so that for example p_{112} is the proportion of applicants who are admitted, male and apply to Department B. The model writes the log of p_{ijk} (or of the expected cell count) as an ANOVA-style expression of sums of main effects and interaction terms, the latter representing various kinds of dependencies among the variables.

Now we’ll perform a log-linear model analysis, specifying a model that consists of all terms through second-order:

> llout <- loglin(ucb,list(1:2,c(1,3),2:3),param=TRUE)

Note carefully the argument **param = TRUE**. To me, one of the most unfortunate aspects of log-linear analysis as it is commonly practiced is that it is significance testing-centric, rather than based on point or interval estimation. The fact that by default parameter estimates are not returned by **loglin()** reflects that, but at least the option is available, thus requested here. (If one uses **glm()** and a “Poisson trick,” one can also get standard errors, though with stepwise model fitting they shouldn’t be taken literally.)

Here we have specified the model to include all two-way interaction terms, and by implication all the main effects as well. Let’s look at some of the parameter estimates that result.

The constant term is 4.802457, which gives us something to compare the other parameter estimates to. Here are the main effects:

**Admit:**

Admitted | Rejected |

-0.3212111 | 0.3212111 |

**Gender:**

Male | Female |

0.3287569 | -0.3287569 |

**Dept:**

A | B | C | D | E | F |

0.15376258 | -0.76516841 | 0.53972054 | 0.43021534 | -0.02881353 | -0.32971651 |

These don’t really tell us much, other than again giving us an idea of what is considered large and small among the terms. The Dept main effects, by the way, are telling us which departments tend to have more or fewer applicants, interesting but not very relevant here.

Let’s look at some interaction terms. Since we are interested in admissions, let’s consider the interactions Admit-Gender and Admit-Department.

**Admit-Gender:**

. | Male | Female |

Admitted | -0.02493703 | 0.02493703 |

Rejected | 0.02493703 | -0.02493703 |

**Admit-Dept:**

. | A | B | C | D | E | F |

Admitted | 0.6371804 | 0.6154772 | 0.005914624 | -0.01010004 | -0.232437 | -1.016035 |

Rejected | -0.6371804 | -0.6154772 | -0.005914624 | 0.01010004 | 0.232437 | 1.016035 |

What a difference! In terms of association with Admit, the Dept variable relation is much stronger than that of Gender, meaning that most of the estimated parameters are much larger in the former case. In other words, Dept is the more important variable, not Gender.

And, the results above also show that there is a positive Admit-Female interaction, i.e. that women fare slightly better than men.

Let’s look at this again from a “forward stepwise regression” point of view. Instead of fitting a three-variable model right away, we could first fit the two-variable models Admit-Gender and Admit-Dept. For the latter, for instance, we could run

> loglin(ucb,list(c(1,3)),param=TRUE)

The result is not shown here, but it does reveal that Dept has a much stronger association with Admit than does Gender. We would thus take Admit-Dept as our “first-step” model, and then add Gender to get our second-step model (again, “step” as in the spirit of stepwise regression). That would produce the model we analyzed above, *WITH NO PARADOXICAL RESULT*.

By contrast, if we “entered the variables” (continuing the stepwise regression analogy) in the wrong order, i.e. used Gender for our first-step model and then added Dept to produce the second-step model, we get a direction reversal–women appear to be disadvantaged in the first-step model, but then turn out to be advantaged once run runs the second-step model.

We could make the regression analogy stronger by actually running logistic regressions, with **glm()**, instead of using the log-linear model (though the parameter estimates and population values would be somewhat different).

Of course, all of this assumes that we are insightful enough to include Dept in our data set in the first place. But in terms of Simpson’s Paradox, the analysis here could go a long way toward dispelling the notion that there is a paradox at all.

Such analysis ignores the historical record: women were steered to “women’s professions” (nursing, teaching) for decades; thus the bias by “department” is merely the reflection of historical bias. Further, it’s well known that girls are superior to boys in quants until high school. Odd, isn’t it, that boys show superiority just when it begins to matter.

My post was not intended to imply that there is or is not bias against women at UCB; I was only discussing Simpson’s (non-) paradox.

As a faculty member in the UC system, though, I can tell you there is huge pressure to increase female representation in engineering and the physical sciences, especially in the faculty. For the record, this is something I support.

And as a mathematician, my experience is that the reasons why the boys overtake the girls in math in high school and college are quite complex.

Hi Norm: That was tremendous. I never thought of it that way. Your technique essentially resolves the paradox by showing that there is no paradox !!! Really clever.

I like Judea Pearl’s coverage of Simpson’s paradox. If you have a causal model that can be agreed upon, there is no paradox, because you know which variables to control for and in what order. However, if you don’t agree on the causal model… there still can be meaningfully different interpretations of the effects, because the different causal models would imply controlling for different variables. Pearl’s POV is that there is no purely statistical way to answer this question. However, also check out Causation, Prediction, and Search (Spirtes, et al) for algorithms which can help at least get to plausible models. These algorithms are generally provably correct if you had an “oracle” tell you conditional independence, but since we rely on stat tests, etc. instead, there is noise. Also some assumptions are often not met in practice, but even in these cases the results can be informative.

How would you interpret the drop1 output from the model below?

Would we drop Admit:Gender?

The large AIC for Gender:Dept validates Robert Young’s comment about gender bias by department.

> margin.table(ucb,c(2,3))

Dept

Gender A B C D E F

Male 825 560 325 417 191 373

Female 108 25 593 375 393 341

ucb.loglm2 drop1(ucb.loglm2)

Single term deletions

Model:

~Admit + Gender + Dept + Admit * Gender + Admit * Dept + Gender *

Dept

Df AIC

58.20

Admit:Gender 1 57.74

Admit:Dept 5 811.61

Gender:Dept 5 1176.90

Also, i thought a major contributor to Simpson’s Paradox was imbalance along the margins. Would there be a problem with Simpson’s Paradox if the design were orthogonal? Is it just the curse of trying to interpret undesigned experiments?

The Gender-Dept interaction shows that women tend to apply to different departments than do men.

Personally, I’m not a big fan of AIC or any other log-based fit measure.

The imbalance issue is merely a symptom, not the disease.

Very well explained. Thanks!