Further Comments on the ASA Manifesto

On Tuesday I commented here on the ASA (in their words) “Position on p-values:  context, process, and purpose.” A number of readers replied, some of them positive, some mistakenly thinking I don’t think statistical inferences are needed, and some claiming I overinterpreted the ASA’s statement. I’ll respond in the current post, and will devote most of it to what I believe are the proper alternatives.

First, though, in order to address the question, “What did the ASA really mean?”, I think it may be helpful discuss why the ASA suddenly came out with a statement. What we know is that the ASA statement itself opens with George Cobb’s wonderfully succinct complaint about the vicious cycle we are caught in: “We teach [significance testing] because it’s what we do; we do it because it’s what we teach.” The ASA then cites deep concerns in the literature, with quotes such as “[Significance testing] is science’s dirtiest secret” with “numerous deep flaws.”

As the ASA also points out, none of this is new. However, there is renewed attention, in a more urgent tone than in the past. The ASA notes that part of this is due to the “replicability crisis,” sparked by the Ionnidis paper., which among other things led the ASA to set up a last-minute (one might even say “emergency”) session at JSM 2014. Another impetus was a ban on p-values by a psychology journal (though I’d add that the journal’s statement led some to wonder whether their editorial staff was entirely clear on the issue).

But I also speculate that the ASA’s sudden action came in part because of a deep concern that the world is passing Statistics by, with our field increasingly being seen as irrelevant. I’ve commented on this as being sadly true, and of course many of you will recall then-ASA president Marie Davidian’s plaintive column title, “Aren’t WE Data Science?” I suspect that one of the major motivations for the ASA’s taking a position on p-values was to dispel the notion that statistics is out of data and irrelevant.

In that light, I stand by the title of my blog post on the matter. Granted, I might have used language like “ASA Says [Mostly] No to P-values,” but I believe it was basically a No. Like any statement that comes out of a committee, the ASA phrasing adopts the least common (read most status quo) denominator and takes pains not to sound too extreme, but to me the main message is clear. For example, I believe my MovieLens data example captures the essence of what the ASA said.

What is especially telling is that the ASA gave no examples in which p-values might be profitably used. Their Principle 1 is no more than a mathematical definition, not a recommendation. This is tied to the issue I brought up, that the null hypothesis is nearly always false a priori, which I will return to later in this blog.

Well, then, what should be done instead? Contrary to the impression I seem to have given some readers, I certainly do not advocate using visualization techniques and the like instead of statistical inference. I agree with the classical (though sadly underemphasized) alternative to testing, which is to form confidence intervals.

Let’s go back to my MovieLens example:


Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4725821  0.0482655  71.947  < 2e-16 ***
age         0.0033891  0.0011860   2.858  0.00436 **
gender      0.0002862  0.0318670   0.009  0.99284
...

The age factor is found to have a “highly significant ” impact on movie rating, but in fact the point estimate, 0.0033891, shows that the impact is negligible; a 10-year difference in age corresponds to about 0.03 point in mean rating, minuscule in view of the fact that ratings range from 1 to 5.

A confidence interval for the true beta coefficient, in this case, (0.0011,0.0057), shows that. Tragically, a big mistake made by many who teach statistics is to check whether such an interval contains 0 — which entirely defeats the purpose of the CI. The proper use of this interval is to note that the interval is near 0, in fact even at its upper bound.

So slavish use of p-values would have led to an inappropriate conclusion here (the ASA’s point), and moreover, once we have that CI, the p-value is useless (my point).

My point was also that we knew a priori that H0: β1 = 0 is false. The true coefficient is not 0.0000000000… ad infinitum. In other words, the fundamental problem — “statistics’ dirtiest secret” — is that the hypothesis test is asking the wrong question.  (I know that some of you have your favorite settings in which you think the null hypothesis really can be true, but I would claim that closer inspection would reveal that that is not the case.) So, not only is the test providing no additional value, once we have a point estimate and standard error, it is often worse than useless, i.e. harmful.

Problems like the above can occur with large samples (though there were only 949 users in the version of the movie data that I used). The opposite can occur with small samples. Let’s use the current U.S. election season. Say a staffer for Candidate X commissions a small poll, and the result is that a CI for p, the population proportion planning to vote for X, is (0.48,0.65). Yes, this contains 0.50, but I submit that the staffer would be remiss in telling X, “There is no significant difference in support between you and your opponent.” The interval is far more informative, in this case more optimistic, than that,

One of the most dramatic examples of how harmful testing can be is a Wharton study in which the authors took real data and added noise variables. In the ensuing regression analysis, lo and behold, the fake predictors were found to be “significant.”

Those of us who teach statistics (which I do, in a computer science context, as well as a member of a stat department long ago) have a responsibility to NOT let our students and our consulting clients follow their natural desire for simple, easy, pat answers, in this case p-values. Interpreting a confidence interval takes some thought, unlike p-values, which automatically make our decisions for us.

All this is fine for straightforward situations such as estimation of a single mean or a single regression coefficent. Unfortunately, though, developing alternatives to testing in more advanced settings can be challenging, even in R.  I always praise R as being “Statistically Correct, written by statisticians for statisticians,” but regrettably, those statisticians are the ones George Cobb complained about, and R is far too testing-oriented.

Consider for instance assessing univariate goodness of fit for some parametric model. Note that I use the word assessing rather than testing, especially important in that the basic R function for the Kolmogorov-Smirnov procedure, ks.test(), is just what its name implies, a test. The R Core Team could easily remedy that, by including an option to return a plottable confidence band for the true CDF. And once again, the proper use of such a band would NOT be to check whether the fitted parametric CDF falls entirely within the band; the model might be quite adequate even if the fitted CDF strays outside the band somewhat in some regions.

Another example is that of log-linear models. This methodology is so fundamentally test-oriented that it may not be clear to some what might be done instead. But really, it’s the same principle: Estimate the parameters and obtain their standard errors. If for example you think a  model with two-way interactions might suffice, estimate the three-way interactions; if they are small relative to the lower-order ones, you might stop at two. (Putting aside here the issue of after-the-fact inference, a problem in any case.)

But even that is not quite straightforward in R (I’ve never used SAS, etc. but they are likely the same).  The loglin() function, for instance,  doesn’t even report the point estimates unless one pro-actively requests them — and even if requested, no standard errors are available. If one wants the latter, one must use glm() with the “Poisson trick.”

In the log-linear situation, one might just informally look at standard errors, but if one wants formal CIs on many parameters, one must use simultaneous inference techniques, which brings me to my next topic.

The vast majority of those techniques are, once again, test-oriented.  One of the few, the classic Scheffe’ method, is presented in unnecessarily restrictive form in textbooks (linear model, normality of Y, homoscedasticity, F-statistic and so on).  But with suitable centering and scaling, quadratic forms of asymptotically normally distributed vectors have an asymptotic chi-squared distribution, which can be used to get approximate simultaneous confidence intervals.  R should add a function to do this on vcov() output.

In short, there is a lot that could be done, in our teaching, practice and software. Maybe the ASA statement will inspire some in that direction.

Advertisement

37 thoughts on “Further Comments on the ASA Manifesto”

  1. Significance testing, which is distinct from hypothesis testing, does not rely on the null hypothesis (beta=0) being true or possibly true; the zero value is only used as a reference point. So, most of your discussion about nulls never being true is irrelevant to the motivation of significance tests.

    I’d also like to echo earlier comments, that you are considerably over-stating what ASA actually said. Such statements are all good fun but ultimately misinformation.

      1. In that case you have a bit of learning to do! It is a major problem to safe use of P-values that so many commenters and teachers of statistics do not understand or care about the distinction between significance testes and hypothesis test.

        In that case you have a bit of learning to do! It is a major problem to safe use of P-values that so many commenters and teachers of statistics do not understand or care about the distinction between significance testes and hypothesis test.

        The distinction between significance testing and hypothesis testing is essentially one about evidence versus decision. Significance tests yield a P-value that stands as an index of the evidence in the data regarding the null hypothesized value of the parameter of interest in the statistical model. That P-value serves as a useful part of the answer to “what do the data say?”. A hypothesis test is a decision theory procedure that leads to a decision or action. A hypothesis test does not answer the question of what the data say, but instead can serve to answer the question of what to do or decide.

        In its original form, there was no P-value in a hypothesis test, as the observed value of the test statistic was compared to critical regions to determine whether the null hypothesis was to be accepted or rejected. Nowadays it is commonplace (unfortunately) to compare observed P-values with thresholds as a surrogate for seeing if the test statistic is in the critical region. That works, but it leads to confusion regarding to how one should use or respond to a P-value.

        You really should get this stuff straight, because significance tests and hypothesis tests serve different purposes, and neither is an “NHST”. I’ve written a paper on the topic that might help: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900/

      2. In that case you have a bit of learning to do! It is a major problem to safe use of P-values that so many commenters and teachers of statistics do not understand or care about the distinction between significance tests and hypothesis test.

        The distinction between significance testing and hypothesis testing is essentially one about evidence versus decision. Significance tests yield a P-value that stands as an index of the evidence in the data regarding the null hypothesized value of the parameter of interest in the statistical model. That P-value serves as a useful part of the answer to “what do the data say?”. A hypothesis test is a decision theory procedure that leads to a decision or action. A hypothesis test does not answer the question of what the data say, but instead can serve to answer the question of what to do or decide.

        In its original form, there was no P-value in a hypothesis test, as the observed value of the test statistic was compared to critical regions to determine whether the null hypothesis was to be accepted or rejected. Nowadays it is commonplace (unfortunately) to compare observed P-values with thresholds as a surrogate for seeing if the test statistic is in the critical region. That works, but it leads to confusion regarding to how one should use or respond to a P-value.

        You really should get this stuff straight, because significance tests and hypothesis tests serve different purposes, and neither is an “NHST”. I’ve written a paper on the topic that might help: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900/

      3. Significance tests, developed by Fisher, reject the null or make no conclusion. Given these choices, one need not believe that beta=0 is plausible or not, just that it is a relevant baseline against which one can make comparisons. The latter condition is weaker than the former.

        Hypothesis tests, a modification of Fisher’s system by Neyman and Pearson, involve a difference choice where one rejects or accepts the null, and accepts/rejects the alternative. If one is going to accept beta=0 it had better be a plausible value.

        As the names above suggest, this distinction is not new; it’s in Cox’s 1977 paper on testing, for example. Google will provide several sets of lecture notes that make the distinction, usually alongside “pure significance tests” where p-values are viewed as just numeric summaries.

        1. Though there indeed was a controversy between the two at the time, it isn’t relevant to the concerns discussed in the ASA statement or the ones I’ve brought up. Your assertion of a “relevant baseline” is particularly contrary to those concerns, I believe.

      4. Norm: you write that distinguishing significance versus hypothesis tests “isn’t relevant to the concerns … I’ve brought up”.

        You claim that knowing beta=0 to be false invalidates the hypothesis testing approach. But you provide no argument that knowing beta=0 to be false invalidates significance testing. So it is relevant; your complaint about testing only addresses one form of testing.

        What the ASA statement says is a red herring, as quite deliberately it did not get into issues of alternative hypotheses or their primacy (or not) over test statistics.

        Section 2.8 of your draft book contains the same missing arguments and mis-statements. I hope you will address them as you further develop it.

        1. Thanks for raising some interesting questions.

          You wrote earlier,

          Significance testing, which is distinct from hypothesis testing, does not rely on the null hypothesis (beta=0) being true or possibly true; the zero value is only used as a reference point.

          I replied that, to me, the distinction doesn’t matter; I still have the same objections. There is no value that I can see in using a “reference point” that we know to be invalid; it’s even worse than taking that value to be a formal hypothesis.

      5. That sounds like an argument against all forms of testing. This is too bad, as there are plenty of practical situations when the only reasonable summaries are some form of yes/no statements, or when the data is so crappy that yes/no is about all they can usefully tell you. Testing is certainly not everything, not even the primary tool to use, but it has a role.

        Also; it’s Eicker, not Eickert; https://en.wikipedia.org/wiki/Friedhelm_Eicker

        1. I addressed this in replying to another poster. Although many, maybe most, situations require a yes/no decision in the end, that does NOT mean that you should use testing to make that decision.

  2. I’ve just been reading through the comments on the ASA statement, from a wide range of statistical and disciplinary authorities who were involved in the discussions that led to the statement – there are 21 in all, available at http://www.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108 For anyone seriously interested in the issues, I think they’re well worth reading, as they further develop the concerns reflected in the statement, and in addition they show a divergence in views that’s largely comparable to the range of views that have been expressed in the comments on this blog.
    FWIW John Ioannadis’ comments (no 10 in the supplement, I think) seemed to me to be very similar to those you expressed in your first post here.

  3. Agreed that base R programming is a bit “testy” (consider the silliness of the significance stars that are part of lm() and many other functions). Also, good point to note that values close to the lower and upper CI limits are much less likely values of the parameter than middle values.

  4. It seems the issue can be put much more simply: frequentist stat analysis doesn’t give comfortable answers to the question field researchers want: is my analysis successful? Successful meaning: does the treatment (experimental or observed) do what I expect?

    Enter the Bayesians who overlay “subjectivism” in order to generate arithmetic which purports to answer the question directly.

    The problem (for both The Frequentist and The Bayesian) is that the sample data is the sole objective reality with which a decision can be made. Given that restriction, NHST is the best arithmetic on offer. Complaining isn’t going to change that. Moreover, CI and p-value are calculated from the same data, and are just two different ways of expressing the likelihood(?) that the treatment worked. If one has done, or kept up with clinical trial results of, drug development, it certainly isn’t the case that we generally know a priori that the null hypothesis is wrong. Quite the opposite. I suspect there are numerous other applied fields where this is so.

    The Bayesian is dissatisfied with the proof by contradiction (or, reductio ad absurdum) mechanism of NHST, so injects “prior knowledge” into the real data and thus creates arithmetic which purports to make direct statements about the sample data. Polluting the data to get a warm and fuzzy answer. Bah.

    Until a new theory of measurement, and thus statistics, is devised, comparing estimates of parameters to parameters or other estimates is what statistics *is*. And that requires using measures of difference. To date, what other alternative is there?

    If we’re going to worry about .05, at all, consider this. Most say that the number means there’s a 5% chance that specific test is wrong. But that’s not what my profs told me. What I was taught is that 5 out of 100 such tests will be wrong (for some definition of wrong). Of course, seldom do we conduct 100 identical tests. If one accepts the latter definition, then does .05 have any meaning in the context of a single, unrepeated, test? I’ve never found a satisfactory answer.

    1. I have the same request to you that I’ve made to a couple of others here: If you think there are real-life cases in which we don’t know a priori that the null hypothesis is false, please post one here.

        1. I believe you were the one who said yesterday that a composite null hypothesis, say, μ1 – μ2 ≤ 0 could be true. In response, I said things go wrong if one looks more closely. Here is what I had in mind:

          There is always measurement bias, which eventually dominates as the sample size goes to infinity. Labs change over time, a clinical trials procedure does not yield true iid structure, and so on. At that point, the null hypothesis becomes neither true nor false; it is simply meaningless; we are no longer measuring things to the precision implied by the crisply-stated H0.

      1. I second your question. I surmise that Mr. Young is suggesting that “new drugs may have no effect,” while you are suggesting that in general, “everything has some effect.”
        I don’t agree with either of these (recognizing that I’m putting words in both your mouths). Possible zero effect example: the phase of the moon may have zero effect on yields in a semiconductor fab. But that could still be worth testing if we had the right data. For example, 50 years ago science textbooks said that the correlation between sunspots, and rabbits in Australia, was pure coincidence. I gather that now we believe there is a genuine relationship. In other words, there could be some totally unknown link between them. But if there isn’t, then the phase of the moon is like astrology — the true effect is genuinely zero.
        Why will pharmaceuticals always have non-zero effect? By the time a new drug makes it to phase 3 trials, it is sure to be biologically active. Every drug affects a number of metabolic pathways in humans, some in ways that we like, and others in “bad” ways that we call side effects. There is likelihood zero that these effects are perfectly balanced. That is especially true when you consider the wide range of humans who will be taking the drug.

    2. The “sample data is the sole objective reality with which a decision can be made” is incorrect. There are previous inferences, which are posterior (predictive) priors, and this leads to a principled way of doing inferential fusion. Frequentist techniques blithely ignore burdens of multiple testing adjustment, and then roll in mechanisms like False Discovery Rate to compensate.

      The most damning aspect is that of sample size dependence … Where no matter how good your model is, if a sufficiently large sample is collected, it can be rejected.

      The appeal of the t-test/hypothesis test/significance test/p-value mindset is one of familiar process. Simply consider what happens to these if incoming data are inherently multimodal (as many are). Is the response to be the experimental did not do their job correctly if that’s the distribution? Once multimodality is embraced, many notions shatter, from point estimates as useful summaries, to confidence intervals. Need Highest Probability Density Intervals.

      And the problem of effective computability has gone, with the MCMC revolution.

      Sure, I’m a Bayesian, but Burnham and Anderson have strongly criticized hypothesis testing as inconsistent, at least for model selection, in their information theoretic approach as well. See their MODELSELECTION AND MULTIMODEL INFERENCE, 2nd edition, 2002.

      1. That word, principled (often paired with the word consistent, has always mystified me. I’ve asked many Bayesians to explain what they mean by it, and none has even tried to give an answer. What is your answer?

  5. The “Wharton Study” was published:
    Dean P. Foster, Robert A. Stine (2006), Honest Confidence Intervals for the Error Variance in Stepwise Regression, Journal of Economic and Social Measurement, 31, 89 – 102.

    1. Thanks, I hadn’t realized that. But the journal’s price is 27.50 Euros, so it’s probably best that I keep the link to the preprint.

  6. I’d like to address that issue of measurement bias. Not only does it dominate as the sample size becomes large–it also dominates, in a way that cannot be overcome, in analyses that depend on iterative processes. The sensitive dependence on initial conditions (error in measurements and error in representation) in a nonlinear optimization such as any likelihood-based analysis will eventually become apparent. Consequently, the “converged value”, whether it is for fixed or random effects, becomes artifactual in the sense that the representation error will eventually dominate, especially in those analyses where the likelihood function is relatively “flat”. And, so far as I can tell, a Bayesian approach does little to avoid this (I am willing to be proven wrong here). Interval estimates from permutation tests may be more robust. I wish I had the time to pull together work to investigate this, but I have to go hang some asterisks on tables for toxicologists…

  7. Norm, I just stumbled upon your blog via a LinkedIn post and am enjoying the spirited discussion on P-values your writing has generated.

    Personally, I did find the ASA statement on P-values to be rather vague – starting with the very informal definition of what a P-value is supposed to be. In fact, I was a tad dissapointed that ASA didn’t consult the statistical community at large while they were crafting the statement so they could determine what we are struggling with as practitioners who use (or see others use) P-values in our daily jobs.

    In reading the ASA statement, I wan’t even sure who the target audience was supposed to be – the statistical community? the research community at large?

    In view of this vagueness, I am not surprised that you concluded that “ASA says no to P-values”. Just like me, you were likely looking for clear guidance on what to do about P-values and realized you had to come up with your own conclusions.

    Of course, ASA did not say NO to p-values – rather it reminded everyone that the p-values are just the tip of the iceberg.

    For a p-value to be correct (in terms of its actual computation), so many other things have to happen: study must be correctly designed, research questions must be correctly defined, data must be correctly collected and analyzed (as well as clean and valid!), etc.

    For a p-value to be correctly interpreted, the principles outlined by the ASA in their statement must all be kept in mind, so one is clear about what the p-value means and what it doesn’t mean.

    One of these principles, which you illustrate in your post, is that “statistical significance is different from practical significance”. As you point out, with a large enough sample size, we can find relationships in the underlying population(s) which we suspect are “real” – but their magnitude may be so “tiny” as to not have any practical relevance.

    What I did like about the ASA statement was the renewed call to not obsess with simply comparing the p-value with a threshold (e.g., 0.05) but realize that there is much more to conducting good research.

  8. Hi Norm! Peter Westfall here. Glad to see you are still pontificating! You were very formative in my training at UCD; your classes had a big impact on me. Good job!

    A couple of comments about all this.

    (1) I worry that people teaching statistics in other areas, who are “less than trained,” let’s say (charitably), will take the ASA statement to mean they can abandon probability altogether. They already do that to a large degree, simply because they don’t understand it. Now they may think they can abandon probability completely when they teach statistics.

    (2) A possible true null case: Relationship between height and last digit of social security number. Is there information content in that last digit? I have a hard time arguing for anything other than the 0.0000000000… . Of course, if you adopt the finite population model it is false, but maybe that’s another rant.

    Cheers, Peter

    1. So, instead of having the usual case of H0 being known a priori to be false, thus making the hypothesis test meaningless, here we have a situation in which H0 is known to be true — again making a test pointless.

  9. I think that all of your arguments are quite persuasive. Perhaps, your title (the original one) is a little excessive. I guess that ASA was afraid of going so far as to recommend to absolutely avoid p values; perhaps there were too many forces against such a tough position, as most of the replies to your post shows to exist. I think it is hard to admit the worthlessness of what a person has been doing during decades (in his/her papers, his/her teaching, etc.).
    Anyway, it is time that we stopped harping on about the imaginary usefulness of p values. The idea of such a utility seems to be a mantra: if it is repeated often enough it will perhaps become reality.

    The choir singing the praises of p values is rather funny: “p values are not the villain”, “the problem is to use is them properly”, “they are problematic, but you have to apply them carefully”. They are funny because nobody can offer a couple of “good, convincing example, in which p-values are useful”, as you say. At the same time, anybody can mention thousands of examples (only googling the last month publications) in which p-values are impractical, futile and, simply, a waste of time.

    Certainly, in any practical situation, we know beforehand that the null is false. But I have some exceptional examples, however, in which the null can be true. Just consider a clinical trial to assess the difference between a homeopathic remedy and water. If there are not biases at all, I guess that you will not be able to reject H0, no matter how big your sample be. Note: Of course, I do not need p-values to prove that homeopathy is indistinguishable from water (no matter how agitated it could be).
    As a final note, and by the way, I want to mention a problem which is rarely considered (if at all): almost everybody (even those stuck to the dogma saying that p values are very useful) agrees on that it is necessary to provide measures of error to accompany published estimates. So, you can find in medicine and in epidemiology papers a lot of odds ratio estimations, for example, with their respective confidence intervals (CI). However, most of the times (I would say, almost always) these CI are simply ornamental attachments, since their limits are not even mentioned in the discussion section.

    1. Thanks for your thoughtful remarks.

      Your homeopathic remedy example is subject to the issue of biases in measurement, as you said. I believe there will always be such things at some level. But as you said, it certainly is not an example of a need for p-values.

    2. Luis, you say “I guess that ASA was afraid of going so far as to recommend to absolutely avoid p values; perhaps there were too many forces against such a tough position”, and I read many similar thoughts in press and on the internet. As one of the participants of the ASA meeting that drafted the statement I can say that you are wrong. There was little support for discarding P-values, and what support there was came from opinions that other approaches are superior for many purposes. However, the things that make P-values difficult to understand and to use safely are shared by the alternative approaches. Confidence intervals and Bayes factors are harder to explain than P-values; give it a try some time. Not only that, but they exhibit exactly the same statistical model-dependence as P-values, a dependence that is core to their nature but is rarely noted.

      Many of the people carping against P-values are actually affected by a deep-seated misunderstanding. I have previously suggested to Matloff that he was should be more careful to note the difference between a hypothesis test that yields a dichotomous decision and a significance test that yields an index of evidence, the P-value. The argument that “we know beforehand that the null is false” is relevant only to the hypothesis test, and has no relevance to the evidential use of a P-value because the evidence can be applied to any and all values of the parameter of interest. Perhaps you might like to read my primer on the topic: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3419900/

      1. I’ve explained confidence intervals and p-values for many years, to both students and consulting clients, and people ALWAYS find confidence intervals much easier to understand.
        And note that all the election polls are reported as confidence intervals. Gee, I wonder why!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.