Statisticians have long known that the use of p-values has major problems. Some of us have long called for reform, weaning the profession away from these troubling beasts. At one point, I was pleased to see Frank Harrell suggest that R should stop computing them.

That is not going to happen, but last year the ASA shocked many people by producing a manifesto stating that p-values are often wrongly used, and in any case used too much. Though the ASA statement did not go so far as to recommend not using this methodology at all, to me it came pretty close. Most significant to me (pardon the pun) was the fact that, though the ASA report said p-values are appropriate in some situations, they did not state any.examples. I wrote blog posts on this topic here, here and here. I noted too that the ASA report had even made news in *Bloomberg Businessweek.*

But not much seems to have changed in the professions since then, as was shown rather dramatically last Saturday. The occasion was iidata 2017, a student-run conference on data science at UC Davis. This was the second year the conference has been held, very professionally managed and fun to attend. There were data analysis competitions, talks by people from industry, and several workshops on R, including one on parallel computing in R, by UCD Stat PhD student Clark Fitzgerald.

Now, here is how this connects to the topic of p-values. I was a judge on one of the data analysis competitions, and my fellow judges and I were pretty shocked by the first team to present. The team members were clearly bright students, and they gave a very polished, professional talk. Indeed, we awarded them First Prize. However…

The team presented the p-values for various tests, not mentioning any problems regarding the large sample size, 20,000. During the judges’ question period, we asked them to comment on the effect of sample size, but they still missed the point, saying that n = 20,000 is enough to ensure that each cell in their chi-squared test would have enough observations! After quite a lot of prodding, one of them finally said there may be an issue of practical significance vs. statistical significance.

Once again, we cannot blame these bright, energetic students. They were simply doing what they had been taught to do — or, I should say, NOT doing what they had NOT been taught to do, which is to view p-values with care, especially with large n. The blame should instead be placed on the statistics faculty who taught them. The dangers of p-values should have been constantly drilled into them in their coursework, to the point at which a dataset with n = 20,000 should have been a red flag to them.

On this point, I’d like to call readers’ attention to the ASA Symposium on Statistical Inference, to be held in Bethesda, MD on October 11-13, 2017. Under the leadership of ASA Executive Director Ron Wasserstein, we are in the process of putting together what promises to be a highly stimulating and relevant program, with prominent speakers, and most important, lots of interactive participation among attendees. Hopefully much of the discussion will address ways to get serious coverage of the issue into statistics curricula, rather than the present situation, in which stat instructors briefly make a one-time comment to students about practical significance vs. statistical significance.

The problem should be the null hypothesis instead of p-value itself?

I prefer not to do tests at all.

What do you rely on then? CIs? Visual Comparisons? Shift functions? other?

The short answer is yes, CIs, but follow the links in my posting to see the details.

Indeed. “A p-value” is often used as a shortcut for “a p-value of a standard hypothesis test of a sharp null hypothesis” (i.e. a null hypothesis having form {p=p0}). And this is quite unfortunate, I think. The problem is not the p-value, it is the sharpness of the null hypothesis.

Summit Suen: Well there are various problems with p-value itself, and also the way they are usually reported and sometimes exploited by bad research practices. Null hypothesis might and might not make sense under different circumstances (yeah, it would be great to test two hypotheses which are derived from two distinct theories and see which one fits the data better, but you would need bayesian prior and “strong” theories to do that).

Testing is not a good way to see which theory fits the data better.

Confidence interval is just another way of looking at the p-value. Still the same information, just a lil bit expanded (giving you boundaries instead of one value that rules them all). You are still not testing your hypothesis, just your data under some assumption. You were writing about no testing at all- still you are using value near 0 as some criterion. This makes even less sense than using p<0.05. I have not seen concepts of effect size (which can used to deal with the problem of inflated p-values in the large samples) or statistical power in your posts.These are missing links which allow you the proper use of frequentist inferential statistics (and you can report CI and p-values as well).

You seem to be conflating the word “testing” with the phrase “making a decision.” I disapprove of the former but certainly support the latter.

As to checking whether a population value may be near 0, that is something a p-value simply cannot do.

Given that your CV is much more impressive than mine, I bet you are right in the first part of your statement (bayesian way of thinking :)). Can you please refer to some articles about “not testing, but deciding”? The second part about population value near zero is a bit more problematic and I disagree with you here. This is what cannot be done with the frequentist approach, even though theory and practice will differ. In theory, you cannot do this, but I understand that in practice- with large enough sample/while doing meta-analysis you can probably say that you are pretty sure whether the population parameter is or is not near zero. Still, it is not a logically valid statement, and you can never access the degree of (un)certainty of this statement. I stopped reading Cumming’s New Statistics because of his strong favouritism for CI and lack of critique of its weaknesses-this is one of the problems which was not discussed properly in the book (in my opinion). Finishing a master in psychology I am really interested how will research practice change in my lifetime (it is in the transition from utter mess to something else right now). Also whether the bayesian stats will be a part of the endeavour (more logically correct does not automatically mean more useful in practice, right?).

For details, I would suggest my open-source textbook on probability and statistics. heather.cs.ucdavis.edu/probstatbook

You and I don’t differ where you think we do. Actually, the difference in the way you and I view things is much more fundamental. You want some kind of Decision Thoery, whereas I want statistics to give me estimates of quantities and then I make my decision based on many factors of interest to me, an informal process.

Thank you very much. I’ll take a look at your book. It seems there is a lot of advanced stuff I am interested in but do not understand yet. Also being open-source I might use parts of your book as a teaching material for courses, which I’ll probably teach in a few years. Still struggling with more complex mathematical notations and formulas, so it is gonna take me a while.

Your idea of statistics makes sense and actually, this was the way it was used back in the old days if I understand it correctly (though they mostly used p-values cause CI were not widespread). But it requires a level of expertise which is rather rare in my line of work. “My” way of seeing and using statistics stems from the paternalistic viewpoint based on the assumption, that with statistically dumb researchers (again I am dumb compared to a statistician, but rather well informed compared to average researcher in my field) it is better (in the long-term) to rely on arbitrary cutoff point rather than giving researchers your kind of freedom. Which kinda makes sense, but again it resulted in not understanding (not wanting to understand) even the most basic principles derived from using this cutoff point. This is something which (I think) cannot be changed by using bayesian or whatever statistics, but only by elevating statistical knowledge of the common researchers. I am sorry I changed the topic a bit, but it goes back to my conviction that what is most correct in the theory might not be optimal in the real life setup.

Anyway, thank your for the discussion and again big thanks for your book.

Try explaining CIs to a nontechnical person. They will immediately get it, in part because they are familiar with the concept of margin of error. Then try explaining p-values; you will hit much resistance. I rest my case.

Also, all other things being equal, the lesser the p-value, the smaller is the chance of Type I error. This is something which is in my opinion best illustrated by p-value + graphically by CI (rather than CI + graphically by CI).

For those of us who believe that the concepts of Type I and II errors themselves are irrelevant to the real world, such considerations don’t matter.

— For those of us who believe that the concepts of Type I and II errors themselves are irrelevant to the real world, such considerations don’t matter.

wow!!! that deserves more than a comment. have you, or someone you regard, written a lengthy takedown of Type I and II?

but what might fit in a comment: if (and, of course, I accept the if) p-value/NHST is just another application of math’s proof by contradiction, then what’s the problem? if p-value/NHST isn’t such, why not?

The problem is that the null hypothesis is in almost all cases meaningless, known to be false a priori.

— The problem is that the null hypothesis is in almost all cases meaningless, known to be false a priori.

not to PhRMA or the FDA, it sure isn’t.

I disagree. Make your case.

— I disagree. Make your case.

clinical trials compare response, some primary endpoint, between a candidate and either SoC or placebo. the null hypo is that the candidate is no different. it is not assumed that a candidate is better, a priori. why would one wish to do that? the Pure Food and Drug Act (and its successors) exists simply because before then all manner of evil compounds were foisted on the populace.

if, and only if, the candidate performs “better”, meeting the primary endpoint to a pre-specified level of significance (and be clinically meaningful) will FDA consider the candidate.

it’s also notable that neither FDA nor EMA nor drug companies seek to implement Bayes for approvals. they know there be dragons that way.

again, it’s just a specific case of proof by contradiction.

if you’ve a better way to consider new compounds, I’ll certainly listen.

I didn’t say we ASSUME H0 is false; we KNOW it is false. I disagree that the pre-specified level of significance is clinically meaningful. Glad to hear they don’t allow Bayes.

I’m just starting out in Stats, and it’s definitely been drilled into me throughout undergrad so far that p-values are used in really problematic ways, and to favor confidence intervals/being very careful about making claims re: significance.

But I’m not quite sure I understand why having a large sample size specifically is a problem for the resulting p-values. Is it because the resulting p-value will have an exaggerated statistical significance thanks to such a large n, which could misrepresent its practical significance?

Or is it just generally because, even in cases where there’s a large sample, the type of hypothesis test a p-value represents doesn’t necessarily translate to meaningful conclusions about the population?

Sorry for the basic question!

It would be better for you to follow the links in my posting, but you basically have it right above.

Hi,

I need to take a deeper look at your book to fully grasp what you mean, but I agree on the problem of p-value being used as “holy” value for everything, but i still think that in certain situation it is relevant, especially for “rejecting”.

The first example coming to my mind is independence test between population – Fisher exact test for example. When you are examining contingency table, and want to reject the null hypothesis of independence between two groups, how would you do that without testing? This has real world implication, especially in the medical field (for example the impact of smoking/having diabetes on heart attack frequency). I tend not to trust human judgement when it comes to analyzing “raw numbers”. (A CI for the odds ratio is also outputted – which I always report, but as stated above, it’s an “extension” of the p-value)

Thank you for your thoughts.

If you believe that a CI is an extension of a p-value, there is not much I can say to you. 🙂

Yes, you can form a CI for the odds ratio, but I would prefer to simply form CIs on the proportions. A more sophisticated way would be to run a log-linear model and then form CIs on the coefficients (or at least look at their standard errors).

The problem, though, is that most people don’t want to go to this much work. A p-value is convenient and simplistic, thus appealing, sad to say.

I definitely didn’t use the right word when saying “extension”. I probably should’ve used “linked”, as in a measure of distance between the distribution tested and the actual distribution, that could be interpreted as a measure of distance between the value that we are testing and the confidence interval that we find.

I usually fit log-linear models afterwards, once I know that independence has been rejected, but one might argue that it’s a form of cheating by having a first peek at the data and choosing variables before.

But don’t get me wrong, I strongly agree with the simplistic nature of p-value you described in a lot of your articles, just wanted to pick your brain on that!

Thanks for the mention about the parallel R presentation. This link will be a bit better: https://github.com/clarkfitzg/junkyard/blob/master/iidata/parallelR/parallelR.ipynb