Statisticians have long known that the use of p-values has major problems. Some of us have long called for reform, weaning the profession away from these troubling beasts. At one point, I was pleased to see Frank Harrell suggest that R should stop computing them.

That is not going to happen, but last year the ASA shocked many people by producing a manifesto stating that p-values are often wrongly used, and in any case used too much. Though the ASA statement did not go so far as to recommend not using this methodology at all, to me it came pretty close. Most significant to me (pardon the pun) was the fact that, though the ASA report said p-values are appropriate in some situations, they did not state any.examples. I wrote blog posts on this topic here, here and here. I noted too that the ASA report had even made news in *Bloomberg Businessweek.*

But not much seems to have changed in the professions since then, as was shown rather dramatically last Saturday. The occasion was iidata 2017, a student-run conference on data science at UC Davis. This was the second year the conference has been held, very professionally managed and fun to attend. There were data analysis competitions, talks by people from industry, and several workshops on R, including one on parallel computing in R, by UCD Stat PhD student Clark Fitzgerald.

Now, here is how this connects to the topic of p-values. I was a judge on one of the data analysis competitions, and my fellow judges and I were pretty shocked by the first team to present. The team members were clearly bright students, and they gave a very polished, professional talk. Indeed, we awarded them First Prize. However…

The team presented the p-values for various tests, not mentioning any problems regarding the large sample size, 20,000. During the judges’ question period, we asked them to comment on the effect of sample size, but they still missed the point, saying that n = 20,000 is enough to ensure that each cell in their chi-squared test would have enough observations! After quite a lot of prodding, one of them finally said there may be an issue of practical significance vs. statistical significance.

Once again, we cannot blame these bright, energetic students. They were simply doing what they had been taught to do — or, I should say, NOT doing what they had NOT been taught to do, which is to view p-values with care, especially with large n. The blame should instead be placed on the statistics faculty who taught them. The dangers of p-values should have been constantly drilled into them in their coursework, to the point at which a dataset with n = 20,000 should have been a red flag to them.

On this point, I’d like to call readers’ attention to the ASA Symposium on Statistical Inference, to be held in Bethesda, MD on October 11-13, 2017. Under the leadership of ASA Executive Director Ron Wasserstein, we are in the process of putting together what promises to be a highly stimulating and relevant program, with prominent speakers, and most important, lots of interactive participation among attendees. Hopefully much of the discussion will address ways to get serious coverage of the issue into statistics curricula, rather than the present situation, in which stat instructors briefly make a one-time comment to students about practical significance vs. statistical significance.

The problem should be the null hypothesis instead of p-value itself?

I prefer not to do tests at all.

What do you rely on then? CIs? Visual Comparisons? Shift functions? other?

The short answer is yes, CIs, but follow the links in my posting to see the details.

Indeed. “A p-value” is often used as a shortcut for “a p-value of a standard hypothesis test of a sharp null hypothesis” (i.e. a null hypothesis having form {p=p0}). And this is quite unfortunate, I think. The problem is not the p-value, it is the sharpness of the null hypothesis.

Summit Suen: Well there are various problems with p-value itself, and also the way they are usually reported and sometimes exploited by bad research practices. Null hypothesis might and might not make sense under different circumstances (yeah, it would be great to test two hypotheses which are derived from two distinct theories and see which one fits the data better, but you would need bayesian prior and “strong” theories to do that).

Testing is not a good way to see which theory fits the data better.

Confidence interval is just another way of looking at the p-value. Still the same information, just a lil bit expanded (giving you boundaries instead of one value that rules them all). You are still not testing your hypothesis, just your data under some assumption. You were writing about no testing at all- still you are using value near 0 as some criterion. This makes even less sense than using p<0.05. I have not seen concepts of effect size (which can used to deal with the problem of inflated p-values in the large samples) or statistical power in your posts.These are missing links which allow you the proper use of frequentist inferential statistics (and you can report CI and p-values as well).

You seem to be conflating the word “testing” with the phrase “making a decision.” I disapprove of the former but certainly support the latter.

As to checking whether a population value may be near 0, that is something a p-value simply cannot do.

Given that your CV is much more impressive than mine, I bet you are right in the first part of your statement (bayesian way of thinking :)). Can you please refer to some articles about “not testing, but deciding”? The second part about population value near zero is a bit more problematic and I disagree with you here. This is what cannot be done with the frequentist approach, even though theory and practice will differ. In theory, you cannot do this, but I understand that in practice- with large enough sample/while doing meta-analysis you can probably say that you are pretty sure whether the population parameter is or is not near zero. Still, it is not a logically valid statement, and you can never access the degree of (un)certainty of this statement. I stopped reading Cumming’s New Statistics because of his strong favouritism for CI and lack of critique of its weaknesses-this is one of the problems which was not discussed properly in the book (in my opinion). Finishing a master in psychology I am really interested how will research practice change in my lifetime (it is in the transition from utter mess to something else right now). Also whether the bayesian stats will be a part of the endeavour (more logically correct does not automatically mean more useful in practice, right?).

For details, I would suggest my open-source textbook on probability and statistics. heather.cs.ucdavis.edu/probstatbook

You and I don’t differ where you think we do. Actually, the difference in the way you and I view things is much more fundamental. You want some kind of Decision Thoery, whereas I want statistics to give me estimates of quantities and then I make my decision based on many factors of interest to me, an informal process.

Also, all other things being equal, the lesser the p-value, the smaller is the chance of Type I error. This is something which is in my opinion best illustrated by p-value + graphically by CI (rather than CI + graphically by CI).

For those of us who believe that the concepts of Type I and II errors themselves are irrelevant to the real world, such considerations don’t matter.

— For those of us who believe that the concepts of Type I and II errors themselves are irrelevant to the real world, such considerations don’t matter.

wow!!! that deserves more than a comment. have you, or someone you regard, written a lengthy takedown of Type I and II?

but what might fit in a comment: if (and, of course, I accept the if) p-value/NHST is just another application of math’s proof by contradiction, then what’s the problem? if p-value/NHST isn’t such, why not?

The problem is that the null hypothesis is in almost all cases meaningless, known to be false a priori.

I’m just starting out in Stats, and it’s definitely been drilled into me throughout undergrad so far that p-values are used in really problematic ways, and to favor confidence intervals/being very careful about making claims re: significance.

But I’m not quite sure I understand why having a large sample size specifically is a problem for the resulting p-values. Is it because the resulting p-value will have an exaggerated statistical significance thanks to such a large n, which could misrepresent its practical significance?

Or is it just generally because, even in cases where there’s a large sample, the type of hypothesis test a p-value represents doesn’t necessarily translate to meaningful conclusions about the population?

Sorry for the basic question!

It would be better for you to follow the links in my posting, but you basically have it right above.

Hi,

I need to take a deeper look at your book to fully grasp what you mean, but I agree on the problem of p-value being used as “holy” value for everything, but i still think that in certain situation it is relevant, especially for “rejecting”.

The first example coming to my mind is independence test between population – Fisher exact test for example. When you are examining contingency table, and want to reject the null hypothesis of independence between two groups, how would you do that without testing? This has real world implication, especially in the medical field (for example the impact of smoking/having diabetes on heart attack frequency). I tend not to trust human judgement when it comes to analyzing “raw numbers”. (A CI for the odds ratio is also outputted – which I always report, but as stated above, it’s an “extension” of the p-value)

Thank you for your thoughts.

If you believe that a CI is an extension of a p-value, there is not much I can say to you. 🙂

Yes, you can form a CI for the odds ratio, but I would prefer to simply form CIs on the proportions. A more sophisticated way would be to run a log-linear model and then form CIs on the coefficients (or at least look at their standard errors).

The problem, though, is that most people don’t want to go to this much work. A p-value is convenient and simplistic, thus appealing, sad to say.