I’m deeply greatful to Hui Lin and the inimitable Yihui Xie for arranging for me to give a “virtual seminar talk” to the Central Iowa R Users Group. You can view my talk, including an interesting Q&A session, online. (The actual start is at 0:34.) There are two separate topics, my regtools package (related to my forthcoming book, From Linear Algebra to Machine Learning: Regression and Classification, with Examples in R), and the recent ASA report on p-values.
I will be an invited speaker at GTC 2016, a large conference on GPU computation. The main topic will be usage of GPU in conjunction with R, and I will also speak on my Software Alchemy method, especially in relation to GPU computing..
GTC asked me to notify my “network” about the event, and this blog is the closest thing I have. My talk is on April 7 at 3 pm, Session S6708. I hope to see some of you there.
The March 28 issue of Bloomberg Businessweek has a rather good summary of the problems of p-values, even recommending the use of confidence intervals and — wonder of wonders — “[looking] at the evidence as a whole.” What, statistics can’t make our decisions for us?
It does make some vague and sometimes puzzling statements, but for the p-values issue to actually find its way into such a nontechnical, mainstream publication as this one is pretty darn remarkable. Thank you, ASA!
The article, “Lies, Damned Lies and More Statistics,” is on page 12. Unfortunately, I can’t find it online.
In my previous posts on the p-value issue, I took issue with the significance test orientation of the R language. I hope articles like this will push the R Core Team in the right direction.
I highly recommend the blog post by Yoav Benjamini and Tal Galili in defense of (carefully used) p-values. I disagree with much of it, but the exposition is very clear, and there is a nice guide to relevant R tools, including for simultaneous inference, a field in which Yoav is one of the most prominent, indeed pre-eminent, researchers. I do have a few points to make.
First, regarding exactly what the ASA said, I would refer readers to my second post on the matter, which argues that the ASA statement was considerably stronger than Yoav and Tal took it to be.
Second, Yoav and Tal make the point that one can’t beat p-values for simplicity of assumptions. I’d add to that point the example of permutation tests. Of course, my objections remain, but putting that aside, I would note that I too tend to be a minimalist in assumptions — I’ve never liked the likelihood idea, for instance — and I would cite my example in my second post of much-generalized Scheffe’ intervals as an example. Those who read my 50% draft book on regression and classification will see this as a recurring theme.
I of course agree strongly with Yoav and Tal’s point about problems with just checking whether a confidence interval contains 0, a point I had made too.
What I would like to see from them, though, is what I mentioned several times in the last couple of days — a good, convincing example in which p-values are useful. That really has to be the bottom line.
On Tuesday I commented here on the ASA (in their words) “Position on p-values: context, process, and purpose.” A number of readers replied, some of them positive, some mistakenly thinking I don’t think statistical inferences are needed, and some claiming I overinterpreted the ASA’s statement. I’ll respond in the current post, and will devote most of it to what I believe are the proper alternatives.
First, though, in order to address the question, “What did the ASA really mean?”, I think it may be helpful discuss why the ASA suddenly came out with a statement. What we know is that the ASA statement itself opens with George Cobb’s wonderfully succinct complaint about the vicious cycle we are caught in: “We teach [significance testing] because it’s what we do; we do it because it’s what we teach.” The ASA then cites deep concerns in the literature, with quotes such as “[Significance testing] is science’s dirtiest secret” with “numerous deep flaws.”
As the ASA also points out, none of this is new. However, there is renewed attention, in a more urgent tone than in the past. The ASA notes that part of this is due to the “replicability crisis,” sparked by the Ionnidis paper., which among other things led the ASA to set up a last-minute (one might even say “emergency”) session at JSM 2014. Another impetus was a ban on p-values by a psychology journal (though I’d add that the journal’s statement led some to wonder whether their editorial staff was entirely clear on the issue).
But I also speculate that the ASA’s sudden action came in part because of a deep concern that the world is passing Statistics by, with our field increasingly being seen as irrelevant. I’ve commented on this as being sadly true, and of course many of you will recall then-ASA president Marie Davidian’s plaintive column title, “Aren’t WE Data Science?” I suspect that one of the major motivations for the ASA’s taking a position on p-values was to dispel the notion that statistics is out of data and irrelevant.
In that light, I stand by the title of my blog post on the matter. Granted, I might have used language like “ASA Says [Mostly] No to P-values,” but I believe it was basically a No. Like any statement that comes out of a committee, the ASA phrasing adopts the least common (read most status quo) denominator and takes pains not to sound too extreme, but to me the main message is clear. For example, I believe my MovieLens data example captures the essence of what the ASA said.
What is especially telling is that the ASA gave no examples in which p-values might be profitably used. Their Principle 1 is no more than a mathematical definition, not a recommendation. This is tied to the issue I brought up, that the null hypothesis is nearly always false a priori, which I will return to later in this blog.
Well, then, what should be done instead? Contrary to the impression I seem to have given some readers, I certainly do not advocate using visualization techniques and the like instead of statistical inference. I agree with the classical (though sadly underemphasized) alternative to testing, which is to form confidence intervals.
Let’s go back to my MovieLens example:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4725821 0.0482655 71.947 < 2e-16 *** age 0.0033891 0.0011860 2.858 0.00436 ** gender 0.0002862 0.0318670 0.009 0.99284 ...
The age factor is found to have a “highly significant ” impact on movie rating, but in fact the point estimate, 0.0033891, shows that the impact is negligible; a 10-year difference in age corresponds to about 0.03 point in mean rating, minuscule in view of the fact that ratings range from 1 to 5.
A confidence interval for the true beta coefficient, in this case, (0.0011,0.0057), shows that. Tragically, a big mistake made by many who teach statistics is to check whether such an interval contains 0 — which entirely defeats the purpose of the CI. The proper use of this interval is to note that the interval is near 0, in fact even at its upper bound.
So slavish use of p-values would have led to an inappropriate conclusion here (the ASA’s point), and moreover, once we have that CI, the p-value is useless (my point).
My point was also that we knew a priori that H0: β1 = 0 is false. The true coefficient is not 0.0000000000… ad infinitum. In other words, the fundamental problem — “statistics’ dirtiest secret” — is that the hypothesis test is asking the wrong question. (I know that some of you have your favorite settings in which you think the null hypothesis really can be true, but I would claim that closer inspection would reveal that that is not the case.) So, not only is the test providing no additional value, once we have a point estimate and standard error, it is often worse than useless, i.e. harmful.
Problems like the above can occur with large samples (though there were only 949 users in the version of the movie data that I used). The opposite can occur with small samples. Let’s use the current U.S. election season. Say a staffer for Candidate X commissions a small poll, and the result is that a CI for p, the population proportion planning to vote for X, is (0.48,0.65). Yes, this contains 0.50, but I submit that the staffer would be remiss in telling X, “There is no significant difference in support between you and your opponent.” The interval is far more informative, in this case more optimistic, than that,
One of the most dramatic examples of how harmful testing can be is a Wharton study in which the authors took real data and added noise variables. In the ensuing regression analysis, lo and behold, the fake predictors were found to be “significant.”
Those of us who teach statistics (which I do, in a computer science context, as well as a member of a stat department long ago) have a responsibility to NOT let our students and our consulting clients follow their natural desire for simple, easy, pat answers, in this case p-values. Interpreting a confidence interval takes some thought, unlike p-values, which automatically make our decisions for us.
All this is fine for straightforward situations such as estimation of a single mean or a single regression coefficent. Unfortunately, though, developing alternatives to testing in more advanced settings can be challenging, even in R. I always praise R as being “Statistically Correct, written by statisticians for statisticians,” but regrettably, those statisticians are the ones George Cobb complained about, and R is far too testing-oriented.
Consider for instance assessing univariate goodness of fit for some parametric model. Note that I use the word assessing rather than testing, especially important in that the basic R function for the Kolmogorov-Smirnov procedure, ks.test(), is just what its name implies, a test. The R Core Team could easily remedy that, by including an option to return a plottable confidence band for the true CDF. And once again, the proper use of such a band would NOT be to check whether the fitted parametric CDF falls entirely within the band; the model might be quite adequate even if the fitted CDF strays outside the band somewhat in some regions.
Another example is that of log-linear models. This methodology is so fundamentally test-oriented that it may not be clear to some what might be done instead. But really, it’s the same principle: Estimate the parameters and obtain their standard errors. If for example you think a model with two-way interactions might suffice, estimate the three-way interactions; if they are small relative to the lower-order ones, you might stop at two. (Putting aside here the issue of after-the-fact inference, a problem in any case.)
But even that is not quite straightforward in R (I’ve never used SAS, etc. but they are likely the same). The loglin() function, for instance, doesn’t even report the point estimates unless one pro-actively requests them — and even if requested, no standard errors are available. If one wants the latter, one must use glm() with the “Poisson trick.”
In the log-linear situation, one might just informally look at standard errors, but if one wants formal CIs on many parameters, one must use simultaneous inference techniques, which brings me to my next topic.
The vast majority of those techniques are, once again, test-oriented. One of the few, the classic Scheffe’ method, is presented in unnecessarily restrictive form in textbooks (linear model, normality of Y, homoscedasticity, F-statistic and so on). But with suitable centering and scaling, quadratic forms of asymptotically normally distributed vectors have an asymptotic chi-squared distribution, which can be used to get approximate simultaneous confidence intervals. R should add a function to do this on vcov() output.
In short, there is a lot that could be done, in our teaching, practice and software. Maybe the ASA statement will inspire some in that direction.
(Note: Please see followup post.)
Sadly, the concept of p-values and significance testing forms the very core of statistics. A number of us have been pointing out for decades that p-values are at best underinformative and often misleading. Almost all statisticians agree on this, yet they all continue to use it and, worse, teach it. I recall a few years ago, when Frank Harrell and I suggested that R place less emphasis on p-values in its output, there was solid pushback. One can’t blame the pusherbackers, though, as the use of p-values is so completely entrenched that R would not be serving its users well with such a radical move.
And yet, wonder of wonders, the American Statistical Association has finally taken a position against p-values. I never thought this would happen in my lifetime, or in anyone else’s, for that matter, but I say, Hooray for the ASA!
To illustrate the problem, consider the one of the MovieLens data sets, consisting of user ratings of movies. There are 949 users. Here is an analysis in which I regress average rating per user against user age and gender:
> head(uu) userid age gender occup zip avg_rat 1 1 24 0 technician 85711 3.610294 2 2 53 0 other 94043 3.709677 3 3 23 0 writer 32067 2.796296 4 4 24 0 technician 43537 4.333333 5 5 33 0 other 15213 2.874286 6 6 42 0 executive 98101 3.635071 > q summary(q) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4725821 0.0482655 71.947 < 2e-16 *** age 0.0033891 0.0011860 2.858 0.00436 ** gender 0.0002862 0.0318670 0.009 0.99284 ... Multiple R-squared: 0.008615, Adjusted R-squared: 0.006505
Woohoo! Double-star significance on age! P-value of only 0.004! Age is a highly-significant predictor of movie ratings! Older people give higher ratings!
Well, no. A 10-year age difference corresponds to only a 0.03 difference in ratings — quite minuscule in light of the fact that ratings take values between 1 and 5.
The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis, in this case H0: βage = 0, and ironically declare this unimportant result “significant.” We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.
In addition, there is the hypocrisy aspect. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.
Speaking of hypocrisy: As noted above, instructors of statistics courses all know of the above problems, and yet teach testing anyway, with little or (likely) no warning about this dangerous method. Those instructors also do testing in their own work.
My hat is off to ASA for finally taking some action.
Nonnegative matrix factorization (NMF) is a popular tool in many applications, such as image and text recognition. If you’ve ever wanted to learn a little bit about NMF, you can do so right here, in this blog post, which will summarize the (slightly) longer presentation here. The R package NMF will be used as illustration.
Given a u x v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices W (u x k) and H (k x v) such that
A ≈ WH
In other words, NMF is a form of dimension reduction.
Note that this means that column j of A is approximately a linear combination of the columns of W, with the coefficients being column j of H. W thus forms what I call a pseudobasis for the columns of A.
The larger the rank k, the better our approximation . But we typically hope that a good approximation can be achieved with k << v.
The matrices W and H are calculated iteratively, with one of the major methods being linear regression. Here is how:
We make initial guesses for W and H, say with random numbers. Now consider an odd-numbered iteration. Suppose just for a moment that we know the exact value of W, with H unknown. Then for each j we could “predict” column j of A from the columns of W. The coefficient vector returned by lm() will become column j of H. We do this for j = 1,2,…,v.
In even-numbered iterations, suppose we know H but not W. We could take transposes,
A’ ≈ H’ W’
and predict row i of A from the rows of H. Then we alternate until we reach convergence.
CRAN’s NMF package for NMF computation is quite versatile, with many, many options. In its simplest form, though, it is quite easy to use. For a matrix a and desired rank k, we simply run
> nout <- nmf(a,k)
The factors are then in nout@fit@W and nout@fit@H.
Though NMF is often used for image classification, with input data consisting of many images, here we will have only one image,
> library(pixmap) > mtr <- read.pnm('MtRush.pgm') > a <- mtr@grey > aout <- nmf(a,50) > w <- aout@fit@W > h <- aout@fit@H > approxa <- w %*% h # brightness values must be in [0,1] > approxa <- pmin(approxa,1) > mtrnew <- mtr > mtrnew@grey <- approxa > plot(mtrnew)
The result is
This is understandably blurry. The original matrix has dimension 194 x 259, and thus presumably has rank 194. (This is confirmed for instance by running the function rankMatrix() in the Matrix package.) We’ve approximated the matrix by one of rank only 50, with a 75% storage savings. This is not important for one small picture, but possibly worthwhile if we have many large ones. The approximation is not bad in that light, and may be good enough for image recognition or other applications.
Indeed, in many if not most applications of NMF, we need to worry about overfitting, which in this context amounts to using too high a value for our rank, something to be avoided.