What Can Go Wrong: My Favorite Example

I’m one of many who bemoan the fact that statistics is typically thought of as — alas, even taught as — a set of formula plugging methods. One enters one’s data, turns the key, and the proper answers pop out. This of course is not the case at all, and arguably statistics is as much an art as a science. Or as I like to put it, you can’t be an effective number cruncher unless you know what the crunching means.

One of the worst ways in which statistics can produce bad analysis is the use of significance testing. For sheer colorfulness, I like Professor Paul Meehl’s quote, “Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path.” But nothing beats concrete examples, and I’ll give a couple here.

First, a quick one: I’m active in the Bay Area R Users Group, and a couple of years ago we had an (otherwise-) excellent speaker from one of the major social network firms. He mentioned that he had been startled to find that, with the large data sets he works with, “Everything is significant.” Granted, he came from an engineering background rather than statistics, but even basic courses in the latter should pound into the students the fact that, with large n, even tiny departures from H0 will likely be declared “significant.”

The problem is compounded by the simultaneous inference problem, which points out, in essence, that when we perform a large number of significance tests, with H0 true in all of them, you are still likely to find some of them “significant.” (Of course, this problem also extends to confidence intervals, the typical alternative that I and others recommend.)

My favorite example of this is a Wharton study in which the authors deliberately added fake variables to a real data set. And guess what! In the resulting regression analysis, all of the fake variables were found to be “significant” predictors of the response.

Let’s try our own experiment along these lines, using R. We’ll do model selection first by running lm() and checking which variables were found “significant.” This is a common, if unrefined, method for model selection. We’ll see that it too leads us astray. Another method for variable selection, much more sophisticated, is the LASSO, so we’ll try that one too, with similarly misleading results, actually worse.

For convenience, I’ll use the data from my last post. This is Census data on programmers and engineers in Silicon Valley. The only extra operation I’ve done here (not shown) is to center and scale the data, using scale(), in order to make the fake variables comparable to the real ones in size. My data set, pg2n, includes 5 real predictors and 25 fake ones, generated by

>  pg2n <- cbind(pg2,matrix(rnorm(25*nrow(pg2)),ncol=25)) 

Applying R’s lm() function as usual,

summary(lm(pg2n[,3] ~ pg2n[,-3]))

we find (output too voluminous to show here) that 4 of the 5 real predictors are found significant, but also 2 of the fake ones are significant (and a third has a p-value just slightly above 0.05). Not quite as dramatic as the Wharton data, which had more predictors than observations, but of a similar nature.

Let’s also try the LASSO. This method, favored by some in machine learning circles, aims to reduce sampling variance by constraining the estimated coefficients to a certain limit. The details are beyond our scope here, but the salient aspect is that the LASSO estimation process will typically come up with exact 0s for some of the estimated coefficients. In essence, then, LASSO can be used as a method for variable selection.

Let’s use the lars package from CRAN for this:

> larsout <- lars(pg2n[, -3],pg2n[, 3],trace=T)
> summary(larsout)LARS/LASSO
Call: lars(x = pg2n[, -3], y = pg2n[, 3],trace=T)
   Df   Rss       Cp
0   1 12818 745.4102
1   2 12696 617.9765
2   3 12526 440.5684
3   4 12146  40.7705
4   5 12134  29.1536
5   6 12121  17.7603
6   7 12119  17.4313
7   8 12111  11.5295
8   9 12109  11.3575
9  10 12106  10.6294
10 11 12099   4.9085
11 12 12099   6.2894
12 13 12098   8.1549
13 14 12098   9.0533

Again I’ve omitted some of the voluminous output, but here we see enough. LASSO determines that the best model under Mallows’ Cp criterion would include the same 4 variables identified by lm() — AND 6 of the fake variables!

Undoubtedly some readers will have good suggestions, along the lines of “Why not try such-and-such on this data?” But my point is that all this goes to show, as promised, that effective application of statistics is far from automatic. Indeed, in the 2002 edition of his book, Subset Selection in Regression, Alan Miller laments that “very little progress has been made” in this field since his first edition came out in 1990.

Statistics is indeed as much an art as a science.

I aim for approximately one posting per week to this blog. I may not be able to get one out next week, but will have one within two weeks for sure. The topic: Progress on Rth, my parallel computation package for R, with a collaborator Drew Schmidt of pdbR fame.


Simpson’s Paradox Is Back

The latest issue of the American Statistician has a set of thought-provoking point/counterpoint papers on Simpson’s Paradox, with a tie-in to the controversial issue of causality. (I will not address the causality issue here.) Since I have long had my own thoughts about Simpson’s, I’ll postpone the topic I had planned to post this week, and address Simpson’s. Much of the material here will be adapted from my open-source textbook on probability and statistics. I’ll use R code to perform the analysis.

As readers undoubtedly already know, Simpson’s Paradox describes a situation in which variables X and Y are positively related overall, but suddenly become negatively related when conditioned on a third variable Z. (Or vice versa.) The clever title of a medical journal paper, “Good for Women, Good for Men, Bad for People,” neatly sums it up.

I’ve always contended that Simpson’s is not a paradox in the first place, and instead is simply the effect of using variables in the wrong order. I mean this in a sense similar to forward stepwise regression; I’m not a fan of stepwise as a variable selection procedure, but my point is that I regard Simpson’s so-called “Paradox” as simply being a problem arising from adding a minor factor to a model before including a major factor. This is in contrast to forward stepwise regression, which aims to add the most important factors first.

To illustrate this, let’s consider what is probably the most famous example of Simpson’s, the UC Berkeley graduate admissions data. This concerns a lawsuit that asserted that UCB discriminated against women in admissions to graduate school. In short: Though the data provided in the suit showed that male applicants had a 44% acceptance rate, compared to only 35% for women, some UCB statistics faculty showed that when a third variable is added to the analysis, things change entirely; in fact, this more probing analysis indicates that actually women were faring slightly better than men in the admissions process.

The problem turned out to be that women tended to apply to the more selective programs at UCB. In applying to departments with lower admissions rates, the women misleading looked to be the subject of bias. Once the department variable is taken into account, the problem is cleared up.

But how might the “paradox” have been avoided in the first place? It will be shown here that (if the relevant variables are there), viewing them in the proper order can be very helpful.

So let’s start up R and take a look. Actually, the UCB data is so famous that it is included in R! Let’s load the data:

> ucb <- UCBAdmissions

Our X, Y and Z here will be Admit (“Admitted” or “Rejected”), Gender (“Male” or “Female”) and Dept. In that last case, six departments are included, referred to as A, B, C, D, E and F.

The data is provided as an R table, which is great, as the R function loglin() expects its input in this form. So we can use R’s table capabilities, for instance determining the total number of observations in the data set:

> sum(ucb)
[1] 4526

We’ll use the log-linear model methodology. Again see my open-source textbook if you are not familiar with this approach, but here is a quick summary, in the present context:

Let pijk denote the population probability of cell (i,j,k), so that for example p112 is the proportion of applicants who are admitted, male and apply to Department B. The model writes the log of pijk (or of the expected cell count) as an ANOVA-style expression of sums of main effects and interaction terms, the latter representing various kinds of dependencies among the variables.

Now we’ll perform a log-linear model analysis, specifying a model that consists of all terms through second-order:

> llout <- loglin(ucb,list(1:2,c(1,3),2:3),param=TRUE)

Note carefully the argument param = TRUE. To me, one of the most unfortunate aspects of log-linear analysis as it is commonly practiced is that it is significance testing-centric, rather than based on point or interval estimation. The fact that by default parameter estimates are not returned by loglin() reflects that, but at least the option is available, thus requested here. (If one uses glm() and a “Poisson trick,” one can also get standard errors, though with stepwise model fitting they shouldn’t be taken literally.)

Here we have specified the model to include all two-way interaction terms, and by implication all the main effects as well. Let’s look at some of the parameter estimates that result.

The constant term is 4.802457, which gives us something to compare the other parameter estimates to. Here are the main effects:


Admitted Rejected
-0.3212111 0.3212111


Male Female
0.3287569 -0.3287569


0.15376258 -0.76516841 0.53972054 0.43021534 -0.02881353 -0.32971651

These don’t really tell us much, other than again giving us an idea of what is considered large and small among the terms. The Dept main effects, by the way, are telling us which departments tend to have more or fewer applicants, interesting but not very relevant here.

Let’s look at some interaction terms. Since we are interested in admissions, let’s consider the interactions Admit-Gender and Admit-Department.


. Male Female
Admitted -0.02493703 0.02493703
Rejected 0.02493703 -0.02493703


. A B C D E F
Admitted 0.6371804 0.6154772 0.005914624 -0.01010004 -0.232437 -1.016035
Rejected -0.6371804 -0.6154772 -0.005914624 0.01010004 0.232437 1.016035

What a difference! In terms of association with Admit, the Dept variable relation is much stronger than that of Gender, meaning that most of the estimated parameters are much larger in the former case. In other words, Dept is the more important variable, not Gender.

And, the results above also show that there is a positive Admit-Female interaction, i.e. that women fare slightly better than men.

Let’s look at this again from a “forward stepwise regression” point of view. Instead of fitting a three-variable model right away, we could first fit the two-variable models Admit-Gender and Admit-Dept. For the latter, for instance, we could run

> loglin(ucb,list(c(1,3)),param=TRUE)

The result is not shown here, but it does reveal that Dept has a much stronger association with Admit than does Gender. We would thus take Admit-Dept as our “first-step” model, and then add Gender to get our second-step model (again, “step” as in the spirit of stepwise regression). That would produce the model we analyzed above, WITH NO PARADOXICAL RESULT.

By contrast, if we “entered the variables” (continuing the stepwise regression analogy) in the wrong order, i.e. used Gender for our first-step model and then added Dept to produce the second-step model, we get a direction reversal–women appear to be disadvantaged in the first-step model, but then turn out to be advantaged once run runs the second-step model.

We could make the regression analogy stronger by actually running logistic regressions, with glm(), instead of using the log-linear model (though the parameter estimates and population values would be somewhat different).

Of course, all of this assumes that we are insightful enough to include Dept in our data set in the first place. But in terms of Simpson’s Paradox, the analysis here could go a long way toward dispelling the notion that there is a paradox at all.

Use of freqparcoord for Regression Diagnostics

This is the third in my series of three posts on my package freqparcoord with Yingkang Xie. (My next post after this will show how to use R to explore one of my favorite examples of “what can go wrong” in statistics.)

Here is a very brief review of my previous posts regarding freqparcoord. A parallel coordinates plot draws one vertical axis for each variable, and then draws each data point as a segmented line connecting the values of the variables for that data point. This typically results in heavy screen clutter, but freqparcoord solves the problem by plotting only the data points with the largest estimated multivariate density value.

The package can also be used for outlier hunting (it plots the points with the smallest density values) and cluster hunting (the points corresponding to local maxima of the density are plotted).

In this posting, I will demonstrate yet another use for the package: regression diagnostics. Note that the term regression here means the conditional mean, so classification problems, e.g. logistic regression, are covered too. Specifically, the package includes a function regdiag() that is used to assess fit of a parametric model, using the main function of the package, freqparcoord().

The basic idea is this: The function computes nonparametric estimates of the regression function at all data points, and then computes the difference between the parametric and nonparametric fits at each point; we call those differences the divergences. The latter become the first vertical axis, and then each of the predictor variables has its own vertical axis. The resulting plots then give the user a picture of where in the predictor space the parametric model often overpredicts, and where it often underpredicts. (Note the role of the nonparametric estimates, so that the divergences are NOT the residuals, and note that we are only plotting the “most typical” values of the divergences, due to the use of density by freqparcoord.)

In addition, in the case in which we make use of the output of R’s lm() linear regression function, regdiag() also makes available the adjusted R2 value for the linear model and a corresponding value (squared correlation of actual Y and predicted Y) for the nonparametric fit.   If the nonparametric fit is  substantially better here, it is an indication that our parametric model may be “leaving money on the table,” i.e. there is room for improvement, say by adding second-order terms.

We use k-NN for the nonparametric fit, by the way, and the estimate at a point uses only genuine neighbors, i.e. the point itself is excluded.  Future versions may use a local-linear approximation, in order to deal with aliasing effects.  All data is centered and scaled.

Here an example, using the prgeng data set from the package.  This is 2000 Census data for programmers and engineers in Silicon Valley.   The R code is

pg <- prgeng
pg1 <- pg[pg$wageinc >= 40000 & pg$wkswrkd >= 48,]
pg1$ms <- as.integer(pg1$educ == 14)
pg1$phd <- as.integer(pg1$educ == 16)
pg1$csee <- as.integer(pg1$occ == 140 | pg1$occ == 141)
pg1$msphd <- pg1$ms + pg1$phd
l1 <- lm(wageinc ~ age+msphd+csee+sex,data=pg1)
p <- regdiag(l1,tail=0.40)

What does this code do? It: loads the data; screens out the odd cases; creates dummy variables for MS/PhD degree and CS/EE job; runs lm(); plugs the output of lm() into regdiag(); “prints” the graph on the screen; and prints out the parametric and nonparametric R2 values.

The latter values are 0.055 and 0.062. These are both low, but the point here is that it does appear that there is room for improvement in the parametric model, possibly by adding second-order terms to the model. Well, then, which second-order terms? The graph gives us some insight here:


The regdiag() function plots the typical large negative divergences on top, the typical large positive ones on the bottom, and the middling ones in the middle. We see a sharp difference in divergences between the top and bottom plots for the Age variable, but not for the others. This suggests adding an Age2 term to the parametric model, but not adding any interaction terms. After doing this (not shown here), the adjusted R2 value for the parametric model becomes comparable to the nonparametric one.

Let’s do a simulation, so we can see how this all works in a known setting:

n <- 10000
x <- matrix(rnorm(3*n),ncol=3)
x <- cbind(x,sample(0:1,n,replace=T))
x <- cbind(x,sample(0:1,n,replace=T))
proddumms <- x[,4] * x[,5]
x <- cbind(x,proddumms)
mx <- x %*% rep(1,6)
y <- mx + rnorm(n,sd=1)
l1 <- lm(y ~ x[,1:5])
p <- regdiag(l1,tail=0.40)

The last two predictors in the original data are dummies, and we add a product term for interaction. We then run lm() without the product term, pretending we don’t know about it, to see if regdiag() will notice. Indeed it does:


The dummy variables are V5 and V6, and the plot indeed shows a lot of activity there! Thus the plot indicates a need for adding an interaction term, as it should.

More on freqparcoord

In a previous post, I introduced my new package with Yingkang Xie, freqparcoord. Here I’ll illustrate some of the other uses to which the package can be applied.

The freqparcoord package visualizes multivariate data by plotting the most frequent cases in the data, as defined by multivariate density estimation. The example in the previous post illustrated typical height, weight and age differences by position among baseball players.

But the package allows plotting the least frequent cases–perfect for outlier hunting. Let’s apply this to the baseball data, say finding 3 outliers for each position:

> library(freqparcoord)
> data(mlb)
> freqparcoord(mlb,-3,4:6,7)

Here columns 4:6 are height, weight and age, while column 7 is position; under default settings, the data will be faceted vertically by position. Here is what is displayed:

All variables are centered and scaled. To get a better idea as to what the extreme values are, we can set the keepidxs argument, and then show the original data corresponding to the displayed lines:

> p <- freqparcoord(mlb,-3,4:6,7,keepidxs=4)
> p
> p$xdisp[,c(1,4:7)]
               Name Height Weight   Age PosCategory
237  Ivan_Rodriguez     69    218 35.25     Catcher
994      So_Taguchi     70    163 37.66  Outfielder
674    Julio_Franco     73    188 48.52   Infielder
964     Barry_Bonds     74    228 42.60  Outfielder
36        Toby_Hall     75    240 31.36     Catcher
35  A.J._Pierzynski     75    245 30.17     Catcher
891  Mike_Restovich     76    257 28.16  Outfielder
547      Tony_Clark     79    245 34.71   Infielder
155   C.C._Sabathia     79    290 26.61     Pitcher
275   Richie_Sexson     80    237 32.17   Infielder
559   Randy_Johnson     82    231 43.47     Pitcher
910       Jon_Rauch     83    260 28.42     Pitcher

Sexson was flagged due to his height and weight, while Franco emerged because he is 48 years old!

Our package can also be used for cluster hunting. Here we again look for the most frequent cases, but now locally most frequent, meaning that their estimated density values are local maxima.

Let’s try it on simulated data, with known clusters. We’ll generate from a mixture of 3 bivariate normals, with means at (0,0), (1,2) and (3,3). (The package includes a function rmixmvnorm() for such experiments.) Here are the results, both graphical and text:

> cv <- 0.5*diag(2)
> x<- rmixmvnorm(10000,2,3,list(c(0,0),c(1,2),c(3,3)),
+ list(cv,cv,cv))
> p <- freqparcoord(x,m=1,method="locmax",
+ keepidxs=1,k=50,klm=600)
[,1] [,2]
[1,] -0.3556997 0.2423760
[2,] -0.0993228 -0.2510209
[3,] 0.2507786 0.1883437
[4,] 1.1386850 2.2073467
[5,] 2.7581227 2.8935957


Of course, the results depend on the arguments, with default values being used here. The reader may wish to experiment with other values of klm. But the data clearly fall into 3 clusters, the correct number, centered near (0,0), (1,2) and (3,3), the correct locations. Note that we did NOT specify the number of clusters.

And there’s more! Watch this space for future posts.

The package is on CRAN but the latest version is here.

Adding Annotation to R Objects

When you take a photograph, you can include the date in the image, so you remember when you created it.  (In fact, under EXIF format, it’s stored in the image file anyway, even if it doesn’t appear in the picture.)  Wouldn’t it be nice to make annotations in the objects you create under R?

For example, here is a random forests analysis I did on some Census data:

pg <- prgeng
pggrd <- pg[pg$educ >= 14,]
rf1 <- randomForest(wageinc ~ age,data=pggrd)

The R object rf1 here has various components, which you can check via the call names(rf1).

But since rf1 is an S3 object, it is thus just a list, and one can add components to a list object at any time. So, one can type, say,

rf1$daterun <- date()

The point is that the date is now part of the object, and when you save the object to a file, e-mail it to someone else and so on, the date can always be checked.

I can also save my set up code there too. I could for example make a function from my setup code, say using edit(), and then assign it to rf1:

> f
function() {
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)
> rf1$setupcode <-f
# check it
> rf1$setupcode
function() {
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)

You can do this with other S3 objects returned from R functions, e.g. plot objects of type “gg” returned from ggplot2 operations.

Some “purists” may object to tinkering with objects like this. If you have such an objection, you can create a formal subclass of the given class, e.g. one named “mygg” in the plot example. (If the object is of S4 type, you’ll need to do this anyway.)