I’m one of many who bemoan the fact that statistics is typically thought of as — alas, even taught as — a set of formula plugging methods. One enters one’s data, turns the key, and the proper answers pop out. This of course is not the case at all, and arguably statistics is as much an art as a science. Or as I like to put it, you can’t be an effective number cruncher unless you know what the crunching means.
One of the worst ways in which statistics can produce bad analysis is the use of significance testing. For sheer colorfulness, I like Professor Paul Meehl’s quote, “Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path.” But nothing beats concrete examples, and I’ll give a couple here.
First, a quick one: I’m active in the Bay Area R Users Group, and a couple of years ago we had an (otherwise-) excellent speaker from one of the major social network firms. He mentioned that he had been startled to find that, with the large data sets he works with, “Everything is significant.” Granted, he came from an engineering background rather than statistics, but even basic courses in the latter should pound into the students the fact that, with large n, even tiny departures from H0 will likely be declared “significant.”
The problem is compounded by the simultaneous inference problem, which points out, in essence, that when we perform a large number of significance tests, with H0 true in all of them, you are still likely to find some of them “significant.” (Of course, this problem also extends to confidence intervals, the typical alternative that I and others recommend.)
My favorite example of this is a Wharton study in which the authors deliberately added fake variables to a real data set. And guess what! In the resulting regression analysis, all of the fake variables were found to be “significant” predictors of the response.
Let’s try our own experiment along these lines, using R. We’ll do model selection first by running lm() and checking which variables were found “significant.” This is a common, if unrefined, method for model selection. We’ll see that it too leads us astray. Another method for variable selection, much more sophisticated, is the LASSO, so we’ll try that one too, with similarly misleading results, actually worse.
For convenience, I’ll use the data from my last post. This is Census data on programmers and engineers in Silicon Valley. The only extra operation I’ve done here (not shown) is to center and scale the data, using scale(), in order to make the fake variables comparable to the real ones in size. My data set, pg2n, includes 5 real predictors and 25 fake ones, generated by
> pg2n <- cbind(pg2,matrix(rnorm(25*nrow(pg2)),ncol=25))
Applying R’s lm() function as usual,
summary(lm(pg2n[,3] ~ pg2n[,-3]))
we find (output too voluminous to show here) that 4 of the 5 real predictors are found significant, but also 2 of the fake ones are significant (and a third has a p-value just slightly above 0.05). Not quite as dramatic as the Wharton data, which had more predictors than observations, but of a similar nature.
Let’s also try the LASSO. This method, favored by some in machine learning circles, aims to reduce sampling variance by constraining the estimated coefficients to a certain limit. The details are beyond our scope here, but the salient aspect is that the LASSO estimation process will typically come up with exact 0s for some of the estimated coefficients. In essence, then, LASSO can be used as a method for variable selection.
Let’s use the lars package from CRAN for this:
> larsout <- lars(pg2n[, -3],pg2n[, 3],trace=T) > summary(larsout)LARS/LASSO Call: lars(x = pg2n[, -3], y = pg2n[, 3],trace=T) Df Rss Cp 0 1 12818 745.4102 1 2 12696 617.9765 2 3 12526 440.5684 3 4 12146 40.7705 4 5 12134 29.1536 5 6 12121 17.7603 6 7 12119 17.4313 7 8 12111 11.5295 8 9 12109 11.3575 9 10 12106 10.6294 10 11 12099 4.9085 11 12 12099 6.2894 12 13 12098 8.1549 13 14 12098 9.0533 ...
Again I’ve omitted some of the voluminous output, but here we see enough. LASSO determines that the best model under Mallows’ Cp criterion would include the same 4 variables identified by lm() — AND 6 of the fake variables!
Undoubtedly some readers will have good suggestions, along the lines of “Why not try such-and-such on this data?” But my point is that all this goes to show, as promised, that effective application of statistics is far from automatic. Indeed, in the 2002 edition of his book, Subset Selection in Regression, Alan Miller laments that “very little progress has been made” in this field since his first edition came out in 1990.
Statistics is indeed as much an art as a science.
I aim for approximately one posting per week to this blog. I may not be able to get one out next week, but will have one within two weeks for sure. The topic: Progress on Rth, my parallel computation package for R, with a collaborator Drew Schmidt of pdbR fame.
I think the example more accurately shows that not respecting the assumptions of a statistical test leads to errors. The data here is observational data, not experimental data, so some of the assumptions of standard significance tests might not be met. This is usually a problem with observational tests, leading to the canard “correlation is not causation”.
This is evidence that people should be much more sensitive to the assumptions, or should be worried about mis-specification, I think.
You also throw in a side comment about significance testing being especially bad: ‘One of the worst ways in which statistics can produce bad analysis is the use of significance testing. For sheer colorfulness, I like Professor Paul Meehl’s quote, “Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path.”’ But, surely if you improperly use ANY statistically technique, you will get junk results (this is something we want, even!). There is nothing especially bad about significance testing in this respect, contrary to what is normally claimed. I don’t think I would trust a test/technique that produces the “right” results when improperly used or when the assumptions are violated – I don’t even understand what that means, to be honest.
Actually, my example was not really observational data. This is the Census Bureau’s 5% sample, so there is genuine i.i.d. sampling at work, drawn from a (near) population. (“Near” because the Bureau doesn’t manage to get surveys from every single person in the population.)
The issue of observational studies in general is not quite as clear-cut as it is often claimed to be. In fact, the issue of bias is just as much an issue as i.i.d.-ness. Even in a fully designed experiment, model bias can compromise the results. In fitting a linear regression model, for instance, inaccuracies in the model will inflate s2, thus distort the standard errors.
One may be able to justify i.i.d.-based standard errors even in an observational study. In a clinical trial on a drug for hypertension with volunteer subjects, one might be able to justify the notion that these volunteers form a random sample from the population of all sufferers of hypertension, at least ones of that region or whatever. On the other hand, it may produce sampling bias. It’s a thorny issue, absolutely, but not an entirely hopeless one.
My bottom line has always been that reported standard errors should not be taken literally, but they are useful as broad indications. My point about significance testing, though, is that it worsens an already-murky situation, and needlessly so.
If you wanted to show that automated feature selection is not error free, you did that. Of course we already knew that. But you haven’t shown that a non-automatic approach is any better than an automatic approach.
You are right, of course. But when I described this as my favorite example, I meant as a tool to quickly illustrate the problem for the majority of users of statistics, who do NOT “already know that.” See my point about our speaker who was startled to find that “Everything is significant” in his big data sets.
Great examples, and a great topic!
Regarding the art (and science, as well) of statistics and analysis, if you just throw variables into a formula just to see “what is significant,” I believe you will always run into these problems. However, if you have some sort of hypothesis, or theory about a mechanism of action that leads to you including (or not including) certain variables, you will be on much firmer ground. In my work, what can be measured regarding human behavior is just a small fraction of all the variables salient to that behavior, so I’m pretty cautious that my findings are a reflection of the variables I am using. With other variables, my findings may potentially be very different.
— even basic courses in the latter should pound into the students the fact that, with large n, even tiny departures from H0 will likely be declared “significant.”
There are a number of issues dealt with here, but this is the most significant(?!). Too often, may be mostly/always, these Big Data folks are processing population data, so they should revert to the first chapter of their Baby Stat book, and realize that they’re just doing descriptive statisitics. The signal example is the Target/preggers girl/her father. Target found some level of correlation among some number of variables in its customer population, high enough to induce it to send ads. There is no inference here: their analysts calculated some distribution and its parameters with certainty. There was then a policy decision to fire off the ads.