(Note: Please see followup post.)

Sadly, the concept of p-values and significance testing forms the very core of statistics. A number of us have been pointing out for decades that p-values are at best underinformative and often misleading. Almost all statisticians agree on this, yet they all continue to use it and, worse, teach it. I recall a few years ago, when Frank Harrell and I suggested that R place less emphasis on p-values in its output, there was solid pushback. One can’t blame the pusherbackers, though, as the use of p-values is so completely entrenched that R would not be serving its users well with such a radical move.

And yet, wonder of wonders, the American Statistical Association has finally taken a position against p-values. I never thought this would happen in my lifetime, or in anyone else’s, for that matter, but I say, Hooray for the ASA!

To illustrate the problem, consider the one of the MovieLens data sets, consisting of user ratings of movies. There are 949 users. Here is an analysis in which I regress average rating per user against user age and gender:

```
> head(uu)
userid age gender occup zip avg_rat
1 1 24 0 technician 85711 3.610294
2 2 53 0 other 94043 3.709677
3 3 23 0 writer 32067 2.796296
4 4 24 0 technician 43537 4.333333
5 5 33 0 other 15213 2.874286
6 6 42 0 executive 98101 3.635071
> q summary(q)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.4725821 0.0482655 71.947 < 2e-16 ***
age 0.0033891 0.0011860 2.858 0.00436 **
gender 0.0002862 0.0318670 0.009 0.99284
...
Multiple R-squared: 0.008615, Adjusted R-squared: 0.006505
```

Woohoo! Double-star significance on age! P-value of only 0.004! Age is a highly-significant predictor of movie ratings! Older people give higher ratings!

Well, no. A 10-year age difference corresponds to only a 0.03 difference in ratings — quite minuscule in light of the fact that ratings take values between 1 and 5.

The problem is that with large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis, in this case H_{0}: β_{age} = 0, and ironically declare this unimportant result “significant.” We have the opposite problem with small samples: The power of the test is low, and we will announce that there is “no significant effect” when in fact we may have too little data to know whether the effect is important.

In addition, there is the hypocrisy aspect. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.

Speaking of hypocrisy: As noted above, instructors of statistics courses all know of the above problems, and yet teach testing anyway, with little or (likely) no warning about this dangerous method. Those instructors also do testing in their own work.

My hat is off to ASA for finally taking some action.

I’d love to see a good follow-up post on alternatives!

Good point. I’ll post something this evening, or tomorrow at the latest.

Thanks for your specific example on the use of p-values. Much of the discussion that I have seen focuses on experiments (e.g., in social psychology). So, it is good to have a clearer idea of what target the ant-p-value movement is aiming at in observational social science.

However, I don’t see this example as a case for not looking at / learning from / making inference from p-values. If our research question was — do older people give higher ratings? — then our answer would be “yes, on average by a tiny amount”. On the other hand, if we we were interested in differences between men and women, then the large p-value would allow us to say “no, we don’t see any evidence for sex differences”.

There is lots of other information in the regression. For example, age and sex combined still barely can predict anything (see r-squared). The size of the coefficients is also there — and in the post you do a good job of explaining the substantive insignificance of the age coefficient. But I would argue that mainstream analyst uses all of these pieces of information _and_ p-values to reach conclusions.

A counter-example is useful. Imagine that the coefficient on sex had been substantively large (e.g. 1 whole point on the 5 point scale) but had a p-value of 0.3. Then all of us would be quite upset with an analyst who told us that sex was really a big deal. We would say “it’s not distinguishable from zero”. We would not ignore the p-value.

In short, the ASA statement still seems to me much ado about very little. But I am open to being wrong!

You seem to be conflating p-values with the needed for assessing sampling error. The latter is absolutely crucial, but the best way to address it is to report standard errors, and if desired, confidence intervals. (The latter NOT being used in a “Does the interval contain 0?” manner.)

Even aside from the issue of p-values being potentially misleading: Once one has a point estimate and a standard error and/or CI, p-values add no new information. So why use them?

“In short, the ASA statement still seems to me much ado about very little. But I am open to being wrong!”

I think the concern is rooted in a real problem. Presumably the ASA is concerned about recent, high-profile replication failures in psychology (source: http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/?no-ist) and behavioral/experimental economics (http://www.sciencemag.org/news/2016/03/about-40-economics-experiments-fail-replication-survey). Replication rates in the 40-60% range are pretty stunningly poor, and folks concerned about this issue have worked to identify the major sources of poor replication rates; misuse of p-values is one commonly targeted source of error.

Yes. In my next posting, I will speculate on what drove the ASA to make its bold statement, part of the reason being the replication issue.

Failure to replicate has some ties to p-value issues (e.g., p-hacking) but the failure emphasize, or even recognize, the need for replication is a much bigger, more important, issue than p-values.

A bit over 50 years ago Cohen warned psychological researchers that much of their research lacked power (average power was ~0.4) and would fail to replicate, assuming that rejection was the correct decision. Cohen’s methods were used again in the mid-1980′ and those researchers found no change, power still hovered around 0.4. I have been told that another replication of this work took place in the mid/late 1990’s and found much the same thing, although I have not seen the report myself.

The bottom line is even assuming all of the research is properly done, and rejection is the correct decision, we would expect a replication rate not much higher than 50%.

But it gets worse, low power has been joined by a host of bad research practices, that all seem to fall under the umbrella of “p-hacking,” that result in badly inflated false rejection rates. Unfortunately the current political climate that dismisses replication as uninteresting and/or unimportant amplifies the problems of p-hacking by effectively shutting down science’s self-correction process.

Replication is so central to good science that Fisher warned researchers that the results of a single study were largely meaningless. It is only when you can replicate your results reliably that you could claim to have learned something.

Yes, absolutely. This comment is spot on. The regression is not a good example at all of why p-values are “bad.” In economics, we never just look at statistical significance but also economic (or maybe in this example, practical) significance. Moreover, in some settings, even small (but statistically significant) estimated coefficients may have large effects in aggregate in a population. There are just so many things to consider.

I’m not so sure that the economics profession as a whole is so careful.

And yes, the fact that “there are so many things to consider” is the point the ASA and I are making.

It’s worth noting that one of the recent moderate-to-large-scale replication studies done was in economics; the successful replication rate was 60% (although on a non-randomly selected set of just 18 studies). 60% is better than the 40% found for a comparable 100-study effort in psychology, but it’s still pretty bad, especially considering this is in experimental/behavioral economics; in areas where experiments are difficult to impossible to carry out (most of macroeconomics), it’s safe to assume that effects are even less likely to be consistently detected by standards methodologies.

You seem to have misread the ASA statement. They did not reject p-values, merely the misuse and general overemphasis of p-values. p-values are fine, but can’t be interpreted in isolation.

Yes, they did not go so far as to recommend an outright ban on usage of p-values. But subject to that, there wording is really quite strong.

Indeed, in a statement send to ASA members, they said, “This is the first time the ASA has spoken so publicly about a fundamental part of statistical theory and practice.” Their action is truly momentous.

I think that your example does not reveal any problem with p-value or sig. testing. It really shows that stat. sig. and practical sig. matter equally. Thus we shouldn’t merely relay on stat. sig.

I won’t comment on the Bayesian approach (not my cup of tea), but as I said in reply to several other reader comments: Once you have a point estimate and a standard error and/or confidence interval, a p-value is adding no further information. So why use it?

You are absolutely right. I would go further: forget about p-value use conf. interval instead. This gives more information. But my point was that your problem description refers to stat vs. practical sig.

Anyway, I love your book (From Algorithms to Z-Scores) because you are one of few statisticians that do not repeat the same stuff. E.g. ch. 17.3.2 when you talk about conf. intervals. The art of programming is also wonderful thus I’m looking forward to your new book.

Thanks very much for your support!

I’m afraid ascribing SE/CIs as cure for “p-values mania” will prove what mathematically already is known: is a tautology. SE/CI suffer from the same problem vis a vis small versus large samples. The “Taylorization” of journal editing will create a new religion on the “inclusion [or not] of zero in the interval”, and we’ll be back to square one. . .

What we need is a new theoretical setting which allow us to compute the probability, likelihood, plausibility, or be ever a new term be created for the alternate hypothesis.

The real problem is that people want statistics to make their decisions for them.

I read the ASA’s statement and agree with Clark. It makes no sense to ban a method which has been in use by many, many scientists and statisticians for decades. What do we do with all the research that has been based on it? Say it is all invalid? Throw all conclusions out the windows and start afresh? Should I burn all my textbooks which instruct its use—some of them written by prominent statisticians and mathematicians? Nor do I know of any reservations besides the usual (see below) accompanying their presentations.

Continuing in this vein, can it be that only now we have seen the light? Sure, there has been a lot of lively discussion about the meaning and use of p-values. But many capable people have continued to use them. Surely they have some use and, in fact, rather than intimate anything like a ban, the ASA’s statement seems to take issue with the unconditioned use of p-values; suggesting that their use be couched in further evidence and contrasted with other methods.

For those who haven’t seen it, the ASA’s conclusion, hardly radical and one which I think is very reasonable is,

“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”

This is hardly doctrinaire and it’s a good thing too. It suggests common sense: don’t put your eggs all in one basket; take a look from different perspective, make sure you know what you’re looking at; be complete and employ proper scientific reasoning, &etc. One thing to note is the emphasis on numerical and graphical summaries i.e. on descriptive statistics. Making inferences is always to go out on a limb to various degrees, which I think the statement also recognizes.

I think a “ban” on the use of p-values would be like banning the aether theory in physics or Lysenkoism. Of course, the American Physical Society or the Royal Society could ban what it regards as false theories but what would be the point? And things may change. In the case of the former, it is amazing how there seems to be a sort of resurrection of the concept. In the case of the latter, its persistent use simply reveals the conditioning of scientific conclusions by an irrelevant ideology; something pretty obvious nowadays and avoided even by the mediocre.

I’m not a statistician myself but it is amusing to see how some become almost like members of the Committee of Public Safety; wishing to root out and condemn all deviation from their viewpoint (such as the Bayesian viewpoint). There’s nothing wrong with Bayesianism, it has its merits but it is not free from criticism. Ditto other alternatives to using p-values.

I certainly would not BAN p-values; I simply urge people not to use them.

Concerning your point that p-values have been used by so many people for so long, and “thus” must good things, I really like the opening paragraphs of the ASA article, which quotes George Cobb pointing out that all that extensive usage of p-values has created a vicious cycle: The more they are used, the more people feel forced to join in.

Your point about excellent mathematicians writing books that use significance testing is very interesting to me, because I feel that they are the root of the problem — mathematicians living in the imaginary pristine worlds of massless, frictionless string. Very elegant stuff, to be sure, but much of it fails in the real world.

I certainly am not advocating ignoring sampling variability. See my replies to other reader comments here regarding standard errors and the like. I’ll write more on that in my Part II posting, hopefully tomorrow.

You opined: ” Almost no null hypotheses are true in the real world.” Perhaps in your world. In product and food development, it is quite common to see experimental products where the null (delta <= 0) is true when compared to an existing commercial implementation. Product that look good on the bench suck in the field. Most commercial implementations have already been optimized, so beating them is quite difficult.

Give a specific example, and I will try to convince you that the null hypothesis is false there too.

Consider a product such as diapers, and a new elastic system. The new elastic system works quite well on the bench. When its tested in the field it fails rather messily. The null would be that the new system is no better than the existing product while the alternative is that it would fail (messily) less. (Notice the dividing hypothesis, not a point null. In my experience point nulls are uncommon, mostly an academic curiosity)

If you consider the published literature on product development, failure rates typically run 80%-90% at a project level, and are much higher at a test level. Products have have multiple objectives and a new formulation has to beat most if not all of them to survive.

This is different from the academic model, where the publication is the endpoint and the results didn’t require much, if any, predictive validity. (e.g. The movie lens data data. Fitting a linear (!!!) model to bounded, asymmetric, poorly measured response. Who would do that outside of academics?)

Actually, most people would indeed fit a linear model, and in most cases it would be good enough.

In thinking about it, the difference may be that academics get to choose their pop, their null, and the publishing target, so the Fisher’s methods, developed for the real world constraints of agronomy, are easily misused. I didn’t get those choices. The pop, the null, and the target were all externally specified. This is similar to the difference between Fantasy Football and the NFL.

Exactly.

WRT to linear model, a linear model can be, and often is, fit to anything that can be coerced onto the real line, but this is similar to fitting a linear model to logistic data. The predictions go out of range, the LOF is high (there is a p-value for that, too!) and the parameters have little physical meaning (e.g. what does the average of a 5-point rating mean? You can’t physically pool the ratings and they aren’t divisible like an bushel of barley is.) The responses are heteroskedastic and asymmetric, with the asymmetry varying with the mean. If I were trying to predict choices or was responsible for the outcomes, I’d use something else.

My point is that this is not a p-value problem, it is an analysis problem.

If the linear model is correct, the parameters definitely have meaning. I gave such a meaning in my example: a 10-year age difference corresponds to a mean rating difference of 0.03. That is true in spite of the heteroscedasticity and skewness.

Of course, just as null hypotheses are rarely if ever exactly correct, the same is true for linear models, and my term “correct” in the first paragraph here is meant in the approximate sense.

Your article – especially your title – is such a misrepresentation of what the article actually said that I wonder at your reason for doing so. Given that what they did say was quite strong, why would you have any need to so misrepresent it at all?

I really disagree. Granted, I could have phrased the title as “The ASA Says [Mostly] No to P-values,” but I think both my title and my comments are basically sound. The MovieLens example in particular reflects the contents of the article quite well, I think even you would agree.

What I have in my posting that is not in the ASA article — and I hope my phrasing wasn’t taken by some to mean it was part of that article — is to note that the null hypothesis is nearly always false

a priori. That means the test is assessing something that is irrelevant to the analyst’s goals. Indeed, it’s too bad that the ASA piece didn’t offer a specific example of an application in which testing might be useful; if they had tried, they would have run into exactly this problem.Merely pointing up some well worn “no nos” hardly counts as opposing them. The ASA might have wanted that but a few sober participants held sway. I was a philosophical observer. it wasn’t intended as a strong statement against them. Unfortunately, we will now see the same QRPs of cherry-picking and selection bias, with priors to boot, but with no grounds for penalties. and no clear grounds to even determine if an effect has been replicated (it’s all a matter of your prior in a Bayes factor). And absolutely no testing of assumptions of models. No can do w/o sig test reasoning. A few bad apples, incapable of avoiding cheating will make it subjective for everyone. Of course the secret practice of statistical significance will continue to be practiced in secret enclaves, as on the isle of Frequentists in Elba The Bayesians will buy our results to reconstruct Bayesianly so they do’t look too bad, and then chalk it up to another Bayesian success story. Here are my invited comments http://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

I’ll join you on Elba, Deborah, but I will have to live on the opposite side of the island. 🙂

“I’ll join you on Elba,…” Were you Able before you saw the island? 🙂

You’ll miss the pow wows,and for what, an utterly arbitrary distinction

I’ll commute. 🙂

I’d want you nearby because of your special abilities. Look, whatever you don’t like. about tests can be reformulated in terms of estimation, only better.

See if you can get this link (about the consequences of only using CIS):

file:///Users/deborahmayo/Downloads/Feinstein%201988%20Scientific%20standards%20in%20epidemiologic%20studies%20of%20the%20menace%20of%20daily%20life.pdf

But, Deborah, that so-called equivalence is exactly the kind of theoretical analysis that I believe has led the profession so far astray.

I’m not sure how you intend me to get that link.

But, Norm, how did this mathematical duality lead the profession so far astray? What’s it even got to do with it?

An example is one already mentioned, the checking of “whether the CI contains 0.” This reduces the CI to a hypothesis test, and thus defeats the purpose of the CI, tossing out the additional information it provides. In most settings, a narros CI

near0 has the same practical meaning as one thatcontains0, but the math stat people treat them as radically different.As I’ve said, the math can be quite elegant, and my own background is in pure, highly abstract math, but the theory just isn’t consistent with the real world.

But I never advocated such a use of CIs. You seem to be arguing that in order to appreciate the duality of a CORRECT use of tests andCIs you’re led to an INCORRECT use of CIs.

I’m bringing this conversation over to my blog. You’ll have to take the ferry, or move over here where you belong.

Aren’t you at least going to give me a token for the ferry? 🙂

I should hasten to point out that I never said (or thought) that you check CIs for containing 0. Actually, I was referring to mathematicians, not philosophers. 🙂

By the way, in light of your Frequentists in Exile theme (one of the cleverest names I’ve ever seen in academia), I contend that one of the major (though not necessarily conscious) appeals that Bayesian methods have for followers of the philosophy is the mathematical elegance.

I’m giving you 6 months of ferry tokens, and thanx for the compliment on “frequentists in exile”. You’ll have to come over and talk to me on Elba to hear my reply to your last comment.

http://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/#comment-139603

Thanks for pointing to the ASA Statement. Always glad to hear about loud criticism against overweighing p-values. Mostly agreeing with what you said, I don’t think the “the null is always wrong” point holds. My background is medicine. A typical medical question might be “Is sport good or bad in people with heart failure?”. Obviously the null ist wrong, as always. But in which direction? A p < 0.05 does not only say, that sport makes a difference. It means, that the direction of the true effect has the same sign/direction as my sample's. There is no advantage over a confidence intervall but it is noch pointless.

Second: As wrong as p-value-testing-traditions may be. Whilst we keep doing classic NHST the small group of Bayes-advocats have time to develop standards to adhere to.

I would submit that the example of “of which direction” really has the same problems I described.

Actually, your title is seriously misleading. They did not say “no”. They said “Don’t overinterpret”. You should modify your title.

No way, Paul! See my followup post today.

P-values were criticized before there were p-values, which being introduced in the mid/late 1920’s have yet to be with us for 100 years, never mind the 150 in the headline. And if you look at articles in the American Statistician it certainly does not appear that the ASA is bothered by p-values per se.

The firestorm of criticism can be linked to Cohen’s 1994 “The Earth is Round, p.05.

What the ASA statement addresses is what I should I should do if p<.05. Students are taught that at this point you absolutely reject slope=1 and declare that you have found some greater truth. The ASA says, and I agree, that p<.05 would be cause to look/think a bit more deeply about your results. If the slope is not precisely 1, and I know it isn't, is it far enough away that I should be concerned? If I should be concerned, I want to look more deeply at my data and procedures to determine, to the extent that I can from a single sample, what is leading to these seemingly anomalous results?

P-values are only one piece of the research puzzle, and a rather small piece at that. But p-values per se are not a problem. Rampant misunderstanding and misuse of p-values is a problem. Failure to contextualize p-values in a larger research context is a problem.

So, in your view, why are p-values of any use at all?

Because they tell you whether you have any reason to consider the possibility that your hypothesized model is wrong.

Think about it as a goodness-of-fit problem. Step 1: is there any reason to believe my model fails to fit the data? If p<.05, then yes there is reason to believe the hypothesized model fails to fit the data and we go on to step 2, assessing and interpreting the misfit.

Depending on the details of what is being studied we may determine that the small p-value (evidence of misfit) means nothing more than that we should look again–i.e. we should replicate the research to see if the p-value is small (the misfit is detectable) another time. Or, perhaps, having looked at the observed effect size we may determine that the degree of misfit is not concerning and no further action is required.

With replication we gain certainty that the model fails to fit or that our original study was a fluke. If the model fails to fit repeatedly by an amount anyone cares about then the model needs revision or replacement.

The problem is that Neyman & Pearson told us that after finding p<.05 we make an irrevocable decision about the truth of the world. Large swaths of researchers have bought into this flawed reasoning leading to a variety of bad practices that are justifiably criticized.

We already know that the hypothesized model is wrong.

Agreed, but it then follows that all the models are wrong as well as their likelihoods, parameters, C.I.’s, and posteriors . They are handy approximations which may be used to make predictions or choose actions. The p-value is a useful index of how good or bad an approximation to my data a specific working model ( including the null) is.

PL Davies has an interesting discussion of this at Mayo’s blog. (http://errorstatistics.com/2016/03/19/your-chance-to-continue-the-due-to-chance-discussion-in-roomier-quarters/). He also has several papers and a book out on models as approximations.

Yes, all models are wrong. That was my point. Accordingly, I am not a fan of likelihood-based estimation and inference.

As you say, though (and George Box famously pointed out), models can be useful approximations, and CIs can work well in that light. But NO, the p-value is NOT a good measure of the closeness of that approximation.

Could you expand on that please? Suppose I have a sample, and the randomization p-value of a statistic for a model is, say, the most extreme permutation. How does that not tell me the model is a bad approximation to the observed statistic?

The p-value will go to 0 as the sample size goes to infinity — even though the degree of approximation stays the same.

As it should. However my sample is fixed, as is my observed data, and I am trying to summarize that data, not some hypothetical data with a sliding sample size and a boat-load of other assumptions.

How do I summarize the adequacy of a simple equality model that isn’t one-to-one with a randomization p-value (for fixed sample size and the data at hand?) For instance, take a paired-comparison with n=15 and a binary response, a first in-vivo trial. As before I have a real external control. If I decide that there is a difference between the treatment and control, the follow-on will be further, more intense, experiments. If not, it’s back to the bench.

The problem is that you don’t know what a small (or large) p-value means for your given sample size. Of course, you could try to remedy that with various power calculations. But that might be difficult in the goodness-of-fit assessment situation you brought up, and in any case, finding confidence intervals is much easier.

Do I know why? Not from the p-value nor from the confidence interval. The p-value simply indicates something is unusual when compared to the control product. Since the testors are specially trained the C.I. is not generalizable to any population of interest, hence the blocked designs. It’s up to me and/or the scientists involved to figure out why the difference occurred (or didn’t). The p-value was never the stopping point, even for release tests.

In the case I am alluding to, we would run hundreds of those per year, so I had some experience with those. Most first of a kind changes are failures. (Over 80% of new “innovations” fail somewhere in the development cycle. Its hard to improve on a already optimized product.)

For one-offs (e.g. proteomics surveys) the scientists involved would be looking at pathways to see which ones had clusters of low p-value, which suggests products that might be of interest. Again the particular counts were not that interesting in and of themselves as the samples were convenience samples.

But you can see the problems in such a screening process. I don’t know your situation, but presumably different products correspond to different powers, i.e. different Type II error probabilities, in which case simply taking the k products with lowest p-values may be misleading. That also leads to issues of what cutoff values to use.

I appreciate the fact that you have a problem which is very messy and very large, for which you (or someone else) has come up with ad hoc methods. But there’s got to be a better way.

Yes it is messy. The wonder is that large CPG firms are continually upgrading performance and changing feature mixes, year by year, decade after decade. The process has many stages and loops, so that “losers” get washed out and good ideas get another implementation, if not now then in several years when the tech has improved.

ASA doesn’t exactly “No to p-values”. They warn on the blatant misuse and misinterpretation of p-values, esp in social sciences and invite people to be aware of this powerful tool. Much like saying that one has to know the rules before playing a game.

Exactly. The problem is not the use of the p-value, but the fact that 90% of people using it does not really understand what they are doing.

Just a couple of comments:

1) In statistics, null hypothesis are NEVER true. You either “reject the null hypothesis” or “do not reject the null hypothesis”, but you never “accept the null hypothesis”. Of course, “not rejecting” the null hypothesis does not in any way mean that the null hypothesis is true. But this does not mean that testing is not useful.

2) A variable might be really significant as a predictor of others, but with really small contribution on it. There is nothing wrong with that. Being “statistically significant” does not mean, necessarily, “having big impact on the model”.

3) It is true that p-value gives less information than a confidence interval. However, confidence intervals do not provide valid criteria when the problem requires a “YES-NO” answer (this is the case when we test which variables have to be included in a model and which not). Using always p > 0.05 might be stupid, bat sometimes the problem requieres a threshold type result. The p-value is useful because it tells you, in case you reject the null hypothesys, the probability of being wrong when you make the decision.

So, I agree with those that interpret the ASA document as a “a bunch of researchers that are doing statistics don’t know what they are doing, and therefore most of their scientific conclusions are not valid” message. But this does not mean p-value is useless.

You say, “The p-value is useful because it tells you, in case you reject the null hypothesys, the probability of being wrong when you make the decision.” Actually, that is incorrect, and it is one of the things the ASA statement warned against.

Concerning your point that in many statistical application one must make a “YES-NO” decision, that is correct — but that does NOT imply that you should let the p-value make that decision for you.

The point you’re making is not about p-values, it is about effect sizes. It is trivial that a trivial difference can be real (non-random, not due to sampling error). Trivial differences (or, more generally, effect sizes) are *usually* not interesting. But random differences are not worth mentioning at all, even if numerically large, and this is what p-values, confidence intervals etc are about. P-values are not perfect in telling random from non-random effects, but you have not touched this issue in your post. I think ASA should warn people against slogans like “ban p-values” or “use p-values everywhere!”

You seem to have misinterpreted my remarks as meaning that I say people shouldn’t worry about random differences. This is not my view, and I hope my second posting clarified that.

I think my only slogan was “Confidence intervals provide more and better information than do p-values, so the latter have no value.”

Norm, it’s been 40 years since that Intro to Math Stats class where you taught me to never trust p values, because the assumption behind it (and the one that the ASA is secretly warning about) that the null model actually holds can be shown to be provably false. And 40 years in the biopharm world, I’ve felt like an outcast for saying things like “Just what do all these superscripts and asterisks have to do with anything?”

Maybe now….

Thanks, Steve. Well, according to Professor Mayo, all outcasts should gather on Elba. 🙂

1) I don’t see where the ASA statement says that. If p < 0.05 and you reject the null hypothesis, you've got 5% of probability of being wrong. "Being wrong" means, in this case, "the null hypothesis is true", and, given the null hypothesys is true, you've got 5% of probability of rejecting it, i.e., making a mistake (no need to say this is true assuming independent sampling and other key modeling aspects)

Now, if p=0.01, you may say that you reject the null hypothesis using p=0.01 as a threshold, which means that you reject the null hypothesis with 1% probability of making a mistake.

The statement by ASA, in its 2nd point, says that the conclusion explained above is true with respect to a hypothetic model, and under random sampling. Therefore, if researchers do not include further information about modelling, and about how the sample has been obtained, the p-value loses its interpretation as a probability. This is why the ASA statement says more analysis is required.

But the interpretation is correct once one makes the modelling and sampling assumptions.

2) Of course NOT. In some cases, you might feel confortable (depending on the application, on the importance of the decision, or any other circumstances) using 0.1 as threshold, and in others, decisions imply so important actions (ban tobacco,…) that you might prefer to use 0.001. In both cases, further analysis have to be done before you make a decision.

Your title seems to provide a gross misrepresentation of the ASA article. The argue against using a p-value as the sole measure for evaluating evidence and focus on correcting the misuse of the p-value. Some useful quotes from the ASA article: “While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted… the American Statistical Association (ASA) believes that the scientific

community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value…No single index should substitute for scientific reasoning.” Your title seems to imply to the reader that the ASA has thrown out the use of the p-value as a valid statistical measure, which they have not.

Well, read my subsequent posts. I noted that the ASA article did not show a single example in which “the p-value can be a useful statistical measure,” which I find very telling.

“This reduces the CI to a hypothesis test, and thus defeats the purpose of the CI, tossing out the additional information it provides. ”

since both CI and p-value are merely (re-)expression of sample variance, from where does this “additional information” of CI come from?

Please see my election poll example.

hey hey be careful. ASA did not say that. Please, read again the manuscript!

Actually, I believe this indeed is the point ASA was making. See my followup post for details.

Hello everyone! Just an underling graduate student of econometrics take on the whole mess…

I am studying to be an econometrician, so this argument has a lot of relevance to me. We care about differences in populations that are both *relevant* AND *significant*. A statistically significant difference of $1.38 in the selling price of mansions over 12 million dollars, although some “statistician types” would worship this as an absolute truism about the world, just isn’t worth the time. We can all agree on this, which was the point of the initial movie example, leading me to the concept of practical significance vs. statistical significance.

This seems to be a misinterpretation of “statistical significance” and “practical significance”. I have taken many statistics courses and never heard this difference mentioned, or it was mentioned as a “subjective afterthought compared to the real value of hypothesis tests”, but it is drilled into our econometrics courses starting from Econometrics 1 in undergrad. A “statistically significant difference” means nothing if the difference is not practically significant.

Generally, practical significance means at least half of a standard deviation difference between two populations – so, if the standard difference in those movie ratings was 1 point, to call two populations “practically different” they would have to have a rating that was at least .5 points greater than other groups. This is just a rule of thumb, and if your doing a real study you should have some vague idea of what a real practically significant difference is. .003% on your salary, for instance, is not a practically significant amount that will change your standard of living, but it may be statistically significant depending on the amount and type of data.

When it comes down to it, practically significant differences are not always easy to find, but statistically significant ones are (if you have a big enough data set and use a high enough p-value). Its all about using the tools in an appropriate, logical way, not just “tank-and-spank”ing your way through, randomly declaring significance based on a program output.

Thanks for reading!

RedFern

See my followup post.