Why Are We Still Teaching t-Tests?

September 15, 2014 matloff 68 Comments

My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the co-posting at Revolution Analytics, and in Slashdot. One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula. Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.

I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.” A precise mathematical understanding of the concepts is crucial to good applications. But stat curricula are not realistic.

I’ll use Student t-tests to illustrate. (This is material from my open-source book on probablity and statistics.) The t-test is an exemplar for the curricular ills in three separate senses:

Significance testing has long been known to be under-informative at best, and highly misleading at worst. Yet it is the core of almost any applied stat course. Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
We prescribe the use of the t-test in situations in which the sampled population has an exact normal distribution — when we know full well that there is no such animal. All real-life random variables are bounded (as opposed to the infinite-support normal distributions) and discrete (unlike the continuous normal family). [Clarification, added 9/17: I advocate skipping the t-distribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
Going hand-in-hand with the t-test is the sample variance. The classic quantity s² is an unbiased estimate of the population variance σ², with s² defined as 1/(n-1) times the sum of squares of our data relative to the sample mean. The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n-1 rather than n. Indeed, even if we do divide by n-1, it is easily shown that the quantity that we actually need, s rather than s², is a BIASED (downward) estimate of σ. So that n-1 factor is much ado about nothing.

Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the t-test, has three major problems.

Sadly, the R language largely caters to this old-fashioned, unwarranted thinking. The var() and sd() functions use that 1/(n-1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.

Much more importantly, R’s statistical procedures are centered far too much on significance testing. Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf. Or consider log-linear models: The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors. (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)

I loved the suggestion by Frank Harrell in r-devel to at least remove the “star system” (asterisks of varying numbers for different p-values) from R output. A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made. To be sure, R in fact is modern in many ways, but there are some problems nevertheless.

In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students. Well, any thoughtful student can see the folly of claiming the t-test to be “exact.” And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n-1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions. If unbiasedness is so vital, why not use 1/(n-1) in the proportions case, a skeptical student might ask?

Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics. Unfortunately, I’ve never been able to find it. But a good cleanup along those lines of the way statistics is taught is long overdue.

68 thoughts on “Why Are We Still Teaching t-Tests?”

Robert says:

September 16, 2014 at 5:25 am

I don’t mind t-tests per se, but there are a lot of crappy little courses where there’s only time for one stat procedure, so it’s the t-test. Then they come to do some project and find it doesn’t work. And they can tell when we’re fudging the issue just to avoid explaining the real grown-up stuff to them. Much better to teach them bootstrapping or MCMC as a generally useful tool, which I contend would not take much longer, especially to the poor level of understanding achieved on the t-test.

Reply
1. matloff says:
  
  September 16, 2014 at 10:47 am
  
  I’m not a fan of MCMC, but I fully agree with your point: Teaching useful, meaningful procedures can be done successfully even in a short, introductory course.
  
  Reply
DR says:

September 16, 2014 at 6:08 am

Here you go: http://www.jstor.org/stable/2987957

Reply
don caldwell says:

September 16, 2014 at 6:45 am

what is not what in statistics, the paper maybe?
http://www.jstor.org/discover/10.2307/2987957?uid=3739808&uid=2&uid=4&uid=3739256&sid=21104634936627

Reply
Mark says:

September 16, 2014 at 7:27 am

Why does everybody complain about the Normality assumption in tests when (1) the test is leveraging the CLT since it’s testing means and (2) it’s just an assumption to increase power. Sure, we could all go to non-parametric tests but if the your mean is sufficiently normally distributed, why not leverage that?

Testing for significance and computing CI are two sides of the same coin – they rely on some underlying distribution of the data and they inform you of what is likely given that distribution. Why knock one and ask for the other?

Certainly statistics is a dated field – most of its common core came from an age when lots of assumptions were necessary because the math was too difficult to work out all the time. The fact that anyone is taught to use a table of critical values is absurd. But what is statistics if not a rigorous method for testing a hypothesis? In the end, you need to make a decision about your hypothesis and testing is one way of doing that.

Reply
1. matloff says:
  
  September 16, 2014 at 10:45 am
  
  As I wrote in response to Joe’s comment, you are confusing the t-test with the Z-test, the latter being the one using the CLT.
  Since I advocate not using testing, power issues are irrelevant.
  Significance testing and confidence intervals are NOT equivalent. For example, if a CI for the difference of two means does not contain 0 but is near 0, there is an indication that the two means are not different in any practical sense. (We know a priori that they are not absolutely equal, to infinitely many decimal places.) As you point out, one must make a decision, but it should be a FULLY INFORMED one.
  
  Reply
  1. jimmy says:
    
    May 24, 2015 at 3:07 pm
    
    http://andrewgelman.com/2014/12/11/fallacy-placing-confidence-confidence-intervals/#comment-202572
    http://andrewgelman.com/2014/12/11/fallacy-placing-confidence-confidence-intervals/#comment-202603
    
    “All confidence intervals are obtained by inverting tests
    and vice versa
    There is a 1-1 correspondence between confidence intervals and tests”
    
    hi dr matloff, i just wanted to provide a pointer to these comments by larry wasserman.
    
    Reply
    1. matloff says:
      
      May 24, 2015 at 8:31 pm
      
      I don’t see your point. It’s a standard statement.
      
      Reply
Jonathan Spencer says:

September 16, 2014 at 7:27 am

Thanks for the post, a good read as always.
That book was presumably this article:
What is Not What in Statistics
Louis Guttman Journal of the Royal Statistical Society. Series D (The Statistician) Vol. 26, No. 2 (Jun., 1977), pp. 81-107 Published by: Wiley (http://www.jstor.org/stable/2987957)

Reply
1. matloff says:
  
  September 16, 2014 at 10:38 am
  
  Thanks to you and others who pointed this out. No wonder I couldn’t find it — I thought it was a book, but it was actually a paper.
  
  Reply
  1. Ellie Kesselman says:
    
    October 3, 2014 at 4:41 am
    
    The key to finding it was the correct title: “What is Not What in Statistics” rather than the catchier “What’s Not What in Statistics”. The author, statistician Louis Guttman, apparently had an entire series of “What is Not What” papers! Excerpt of book review, “Applied Psychological Measurement”, 1994, pp. 293-297 by Peter H. Schönemann, Purdue University via http://www.schonemann.de/LGREV.htm
    
    “Guttman’s What is not What in Statistics paper. In spite of its wide circulation, it does not yet seem to have had much impact on how statistics is taught to unsuspecting psychology majors.”
    
    Reply
    1. matloff says:
      
      October 5, 2014 at 8:19 pm
      
      Thanks, many pointed this out, and yes, great title, regardless of details.
      
      Reply
Joe Liebig (@JochenLiebig) says:

September 16, 2014 at 8:34 am

‘We prescribe the use of the t-test in situations in which the sampled population has an exact normal distribution —’

Nope, that is inaccurate. The population can be any distribution. That’s the great thing about the limit theorem. Since the t-test is about the mean, it’s quite trivial to show that the sampling distribution is always Normal.

Reply
1. matloff says:
  
  September 16, 2014 at 10:37 am
  
  Using the Central Limit Theorem is NOT the t-test. The test statistic using CLT is compared to the normal distribution, not the Student-t distribution. And of course you mean that the sample mean is approximately normal.
  
  I advocate skipping the t-test, going directly to the CLT version (though here I am talking about confidence intervals, not tests). This is what I do in my book. (See link in my posting.)
  
  Reply
  1. Ellie Kesselman says:
    
    October 6, 2014 at 1:30 am
    
    You’re a sweetie! That’s in regard to your prior reply; you have such good manners.
    
    Reply
Philip Branning says:

September 16, 2014 at 8:57 am

I was exposed to Judd and McClelland’s model comparison approach to teaching statistics, and I found it to be extremely lucid and easy to understand. I’m always surprised that this isn’t taught in more places, though I’d guess it’s because those guys are psychologists, rather than statisticians. In this approach, the regression model is is the primitive, and all statistical questions are posed as comparisons between models. Within this framework, you can develop many statistical tests “from first principles”, and you can easily adapt a test to your problem. Though it’s taught as the graduate statistics course for psychology and other social scientist PhD candidates at CU Boulder, it isn’t mathematically demanding in any way that would prevent it from being taught at a secondary level.

Reply
1. matloff says:
  
  September 16, 2014 at 10:49 am
  
  This is really interesting. I myself view regression as the primitive (though my book doesn’t present things that way). Do you have some good links to their work?
  
  Reply
  1. Philip Branning says:
    
    September 16, 2014 at 1:35 pm
    
    They have a website for the book here: http://www.dataanalysisbook.com/about.html
    Unfortunately, it’s pretty skeletal. Perhaps this is why the word isn’t out there! If you are interested, I’d suggest emailing or calling Chick (Charles) Judd. He taught the class the year I took it, and he would probably explain the approach and perhaps send you a copy of the book if you asked.
    
    Reply
  2. Philip Branning says:
    
    September 16, 2014 at 1:45 pm
    
    They describe the approach in some detail in an AnnRevPsych paper here:
    Judd, C. M., McClelland, G. H., & Culhane, S. E. (1995). Data analysis: Continuing issues in the everyday analysis of psychological data. Annual review of psychology, 46(1), 433-465.
    Google Scholar link:
    http://scholar.google.com/scholar?cluster=14481702516523332591
    
    Reply
  3. pneumatico says:
    
    March 26, 2018 at 5:55 pm
    
    Here’s a PDF of that Annual Review of Psychology paper.
    
    Click to access Data-Analysis-Continuing-Issues-in-the-Everyday-Analysis-of-Psychological-Data.pdf
    
    Reply
Robert Young says:

September 16, 2014 at 11:53 am

I certainly don’t disagree, but I don’t see a clear alternative path. The pedagogical point of t or Z or (elementary) probability or … You know, all that “foundational” stuff needed to intellectually justify regression or MCMC ( 🙂 ) or factor analysis (any old psychometricians in the audience?) or …

Just as we teach kids arithmetic addition before calculus, we teach simple sample tests before the “hard” stuff. So, and I’ve been wrestling with this question too much lately, what to do?

If we take the “learn them how to do Excel” approach of skipping both the theory and the underlying analytics, we get quant/stat fools like Li and The London Whale. They know how to click the buttons, but haven’t a clue what’s going on inside that black box.

OTOH, we can continue to insist on “foundation” before edifice, thus repelling some, possibly bright, folks.

Years ago I taught, to working analysts of a sort, one week stat/quant courses (no stat background required). We built the materials ourselves, and stressed the mechanics of using the command set available (mostly mainframe packages) and the measure assumptions being made. “Here’s your bunch of data. Here’s how to do linear regression or PCA or … Here’s where the booby traps are. Be careful where you step”. We skipped the balls and urns and only mentioned Student in passing because of the beer.

Reply
1. Norm Matloff says:
  
  September 16, 2014 at 12:33 pm
  
  There are really only two basic concepts in statistics — bias and variance. I mean bias in the broad sense, including the problems of not taking into account important covariates.
  If the background of the students permits, one should make good use of calculus and linear algebra. (This is all I mean when I talk of math; I’m not referring to theorems and proofs.) But for students without this background, one can do quite well by repeated-sampling arguments and the like. (Sorry for the frequentist view. 🙂 ) And this can be done quickly; I teach on a quarter system, so know how to gauge such things. 🙂
  
  Reply
Jason Liao says:

September 16, 2014 at 2:23 pm

Dividing by n-1 instead of n is primarily motivated by the concept of degree of freedom. I consider this a fundamental concept especially when you deal with complex models.

Reply
1. matloff says:
  
  September 16, 2014 at 5:05 pm
  
  That’s a really interesting remark, thanks. But could you elaborate? I’m not sure what specifically you have in mind. Certainly one should be aware of how many “free” parameters there are in one’s model, but that doesn’t imply that we have to divide by n minus that number.
  
  Reply
  1. Jason Liao says:
    
    September 17, 2014 at 2:35 pm
    
    The most compelling argument for dividing by n-1 is as follows. Let x_1,….,x_n \sim N(\mu, \sigma^2). If you know \mu, then MLE is 1/n sum (x_i – \mu)^2. If you do not know \mu, you use the mean of x, x bar, in place of \mu. It turns out that sum (x_i – x bar)^2 has the same distribution as sum (x_i – \mu)^2 summed over only n-1 terms, however.
    
    This is related to the so called restricted MLE (reml) for variance components.
    
    Reply
  2. Norm Matloff says:
    
    September 17, 2014 at 4:40 pm
    
    (In reference to Jason’s Sept. 17 reply.)
    
    That’s interesting stuff, thanks. But I still don’t get it. It seems to me that the relevance of your argument to dividing by n-1 (as opposed to relevance to other issues) really boils down to saying we need to divide by n-1 to achieve an unbiased estimate of σ² — which I dismissed as being of no apparent use, especially when one recognizes that even with a divisor of n-1, the quantity that really counts, s, is STILL biased (simple application of Jensen’s Inequality). In other words, back to Square 1.
    
    The one thing that changes, it seems to me, is that in your context of variance components, we really ARE interested in σ² rather than σ, But that’s very different from the t-test situation I brought up.
    
    Also, I think even your variance components model would clash with another one of the issues I brought up in my blog posting. If I recall correctly, variance component models are quite nonrobust to the assumption of normal populations. I would assume that some research may take care of that to some degree, but my point is that I believe that statistics courses shouldn’t teach methods that depend heavily on assumptions of exact continuous population distribution models.
    
    Reply
Pingback: Somewhere else, part 164 | Freakonometrics
Mervyn Thomas says:

September 16, 2014 at 4:15 pm

I have run statistics operations in quite large public and private sector organisations, and directly supervised many masters and PhD level statisticians. The biggest problem I had with new statisticians was helping them to understand that nobody else cares about the statistics.

Of course the statistics is important, but only in so far as it helps produce solid and reliable answers to problems – or reveals that no such answers are available with current data. Nearly everybody is focussed on their own problems. The trick is producing results and reports which address those problems in a rigorous and defensible way.

In a sense, I see applied statistics as more of an engineering discipline – but one that makes careful use of rigorous analysis.

I believe that statistics departments have largely missed the boat with data science (except for a few stand out examples like Stanford), and that the reason is that many academic statisticians have failed to engage with other disciplines properly. Of course, there are very significant exceptions to that – Terry Speed for example.

One of the most telling examples of that for me is the number of time academic statisticians have asked if I or my life science collaborators could provide them with data to test an approach — without actually wanting to engage with the problem that generated the data.

Relevance comes from engagement, not from rarefied brilliance. There is no better example of that than Fisher.

Does it matter? Yes because I see other disciplines reinventing the statistical wheel – and doing it badly.

Reply
1. matloff says:
  
  September 16, 2014 at 5:01 pm
  
  Very interesting comments. I largely agree.
  
  Sadly, my own campus, the University of California at Davis, illustrates your point. To me, a big issue is joint academic appointments, and to my knowledge the Statistics Dept. has none. This is especially surprising in light of the longtime (several decades) commitment of UCD to interdisciplinary research. The Stat. Dept. has even gone in the opposite direction: The Stat grad program used to be administered by a Graduate Group, a unique UCD entity in which faculty from many departments run the graduate program in a given field; yet a few years ago, the Stat. Dept. disbanded its Graduate Group. I must hasten to add that there IS good interdisciplinary work being done by Stat faculty with researchers in other fields, but still the structure is too narrow, in my view.
  (My own department, Computer Science, has several appointments with other disciplines, and more important, has actually expanded the membership of its Graduate Group.)
  I would say, though, that I think the biggest reason Stat (in general, not just UCD) has been losing ground to CS and other fields is not because of disinterest in applications, but rather a failure to tackle the complex, large-scale, “messy” problems that the Machine Learning crowd addresses routinely.
  
  Reply
  1. Mervyn Thomas says:
    
    September 16, 2014 at 5:19 pm
    
    “a failure to tackle the complex, large-scale, “messy” problems that the Machine Learning crowd addresses routinely.” Good point! I have often struggled with junior statisticians wanting to know whether or not an analysis is `right’ rather than fit for purpose. That’s a strange preoccupation, because in 40 years as a professional statistician I have never done a `correct’ analysis. Everything is predicated on assumptions which are approximations at best.
    
    Reply
Ilya Kipnis says:

September 17, 2014 at 5:33 pm

I feel similar about the T-test. For one, it’s really unintuitive, and in this age of easily collectible data, makes less and less sense. Essentially, the way I view the interpretation of the P-value is “if there were no relationship whatsoever, what is the probability that we would simply observe the relationship by chance?”, which makes little sense, because, unless there’s absolutely no relationship between the predictor and the predicted, (x and y, independent and dependent), of *course* you’re going to have significance given enough data.

P value not less than .05? Collect more data. Still not less than .05? Collect more data still. Eventually, you’ll get enough data. It’s a way to cheat the test, and thus the test only makes sense if data is hard to come by–which it was in the early 20th century when this outdated piece of work was invented.

And another thing, why do we even need to assume the distribution of the data? Normally distribution, F-distribution, does it matter? If you have enough data, do some EDA on it. Plot it, run some quantile tests (which IMO work a lot better than confidence intervals), do some bootstrapping and see what your 5th percentile or 2.5th percentile is, etc…

Reply
1. Mervyn Thomas says:
  
  September 18, 2014 at 12:28 am
  
  Ilya there is a huge literature on sequential problems – see for example the great textbook by Jennison and Turnbull “Group Sequential methods with application to clinical trials”. There have, of course, been equally satifying developments in adaptive and sequential designs from a Bayesian perspective. As I am sure you know, what you describe is a group sequential procedure, and an appropriate significance test has to take account of both the dependence between successive tests, and the multiplicity of repeated testing.
  
  In other words, the cheat you describe is simply wrong analysis (from a frequentist perspective).
  
  What you are saying seems to me to be “it’s not appropriate to use a t test when it’s not appropriate to use a t test”. That’s incontrovertible – but not really very useful.
  
  Reply
Pradeep Mishra says:

September 17, 2014 at 11:18 pm

It can not be denied that most of the statistics we know today was taught/invented 100 years back by prodigies like Francis Galton.
Not only it is important now to understand what is outdated, there is also a need to come up with new ideas

http://www.praddy.in/are-computers-blinding-mathematicians/

Reply
1. matloff says:
  
  September 17, 2014 at 11:25 pm
  
  In my view, most of the old is still valid, while most of the new is actually repackaging of the old.
  
  Reply
  1. Mayo says:
    
    October 5, 2014 at 9:27 pm
    
    agree
    
    Reply
2. Mervyn Thomas says:
  
  September 17, 2014 at 11:52 pm
  
  I really disagree Pradeep; what about Fisher “Statistical Methods for Research workers” 1925, The design of Experiments 1935, Nelder and Wedderburn 1972, Effron’s bootstrap 1979, the whole flowering of computationally intensive inference and Bayesian methods, Hastie and Tibshirani on generalised additive models (1983 onwards), generalised linear mixed models from Breslow and Clayton in 1993 (and very substantial later research in this topic from other authors). In the span of my career, about 40 years, the face of statistical practice has changed enormously. This is a discipline which, despite its problems, remains dynamic, exciting and challenging.
  
  Reply
Mayo says:

September 18, 2014 at 8:50 pm

Norm: Thank goodness your advice has yet to be taken or they never would be able to carry out fraudbusting:

Who ya gonna call for statistical Fraudbusting? R.A. Fisher, P-values, and error statistics (again)

Seriously, Norm I have to say that the most bizarre part of your book and statistical philosophy is your cavalier rejection of statistical tests, especially while embracing confidence intervals (CIS are fine but I can show you they demand testing reasoning to avoid glaring fallacies).

Anything Tests Can do, CIs do Better; CIs Do Anything Better than Tests?* (reforming the reformers cont.)

Do CIs Avoid Fallacies of Tests? Reforming the Reformers

Being scared off by a bunch of misuses and misinterpretations of tests* rather than teaching their correct use and interpretation,is the most damaging portrayal of general frequentist statistics.

*and extreme views about assuming models must hold exactly in tests. Presumably tests of assumptions, being significance tests, would also be banished.

Reply
1. matloff says:
  
  September 18, 2014 at 11:45 pm
  
  Thanks for the provocative response to my hopefully-provocative posting, Deborah. 🙂 I’ll keep my comments brief here, but feel free to reply again here, or offline.
  
  It’s ironic that you use fraud detection as a counterexample, because I have a network intrusion example in that very same book of mine. We’re talking about two very different contexts here.
  
  As to tests vs. CIs, I’ll be interested, as always, in your philosopher’s take on the issue, but one thing is undeniable: CIs are more informative than significance tests; note my term “under-informative” in my post. I believe it’s also generally agreed upon that CIs enable one to avoid the pitfalls of testing (provided, of course, that they are not used as tests!).
  
  Finally, the “advice” on testing is hardly mine. As I wrote, it’s been recognized for a long time (there is even a mini-bibliography in my book), and I think one would be hard pressed to find many statisticians who disagree. The problem is not lack of consensus, but rather a convenient hypocrisy.
  
  Reply
  1. Mervyn Thomas says:
    
    September 19, 2014 at 12:28 am
    
    You are absolutely right, in general confidence intervals are more informative. Of course there is a strong relationship between confidence intervals and significance tests – we can generate the CI by inverting the test. But in practical application the only times I regularly use tests are:-
    
    1) in massively multivariate contexts like gene expression or spectroscopy where I need to screen a large number of variables and discard those that are not interesting.
    2) In situations where there are composite tests around multiple parameters (typically model selection) and no single CI captures the important features of the problem.
    
    Confidence intervals are not without problems. Hacking (Logic of statistical inference pp 95-102, 159) points out that that they are fundamentally before-trials rules. That is, before looking at the data we set up a rule which will generate an interval, then calculate the interval from the data. This can lead to some odd situations. The classic example is a sample of size two drawn from a distribution uniform on (\alpha-1,\alpha+1). Now the minimum and the maximum form a 50% confidence interval (they will include \alpha 50% of the times they are calculated, irrespective of the value of \alpha). But if the range is >1, the interval *must* contain alpha. Here we have a 50\% confidence interval which, in some realisations, contains the parameter with probability 1.
    
    Fisher insisted that we condition on ancillary statistics. In this case the range is ancillary. Indeed, if we do that, we derive another interval which does not have this silly property. But in Neyman Pearson terms, there is no real reason to do this.
    
    Most times people want something approximating a Bayesian credible interval – which, of course, never suffers from this before trials problem.
    
    Reply
rockclimber112358 says:

September 19, 2014 at 5:10 am

I agree with most of your complaints, in particular that hypothesis testing is becoming outdated and the silliness of 1/(n-1) with the population variance. However, I strongly disagree with your argument about the t-test not being appropriate when we do not have exactly normal data. While this is an assumption of the test, the test is also very robust to violations of this assumption. Now, certainly the test may not work well with highly skewed data, but it can easily handle symmetric distributions outliers. In particular, for a fixed degrees of freedom, the influence function for the center and scale are both bounded, which is a very nice property!

Reply
1. matloff says:
  
  September 19, 2014 at 8:34 am
  
  My original post was not clear enough on what I advocate using instead of the t-distribution. I did clarify that in replying to the comments, and later added a clarifying sentence to the original post:
  
  [Clarification, added 9/17: I advocate skipping the t-distribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
  
  You are correct that inference based on the t-distribution is robust to the normal-population assumption, i.e. a reasonable approximation, but the same is true for CLT-based approach. Might as well just always use the latter.
  
  Reply
matloff says:

September 19, 2014 at 8:58 am

Reply to Mervyn’s Sept. 19 comment:

Using tests for variable selection is risky. There is old research, for instance, showing that if one does this, one should use a very large α level. It would be interesting to know how well this works for groups of variables. I really like a comment Alan Miller makes in his second edition of Subset Selection in Regression which is that he felt no progress had been made in the area since his first edition had been published. 🙂

The before-trials issue is unsolved too (at least for us frequentists).

Reply
1. Mervyn Thomas says:
  
  September 19, 2014 at 1:27 pm
  
  Re your comments about variable selection, absolutely. I use t test in a generic sense. Normally we use empirical Bayes moderated t tests with alpha controlled using either the Benjamini Hochberg approach (or similar) or Holm’s method.
  
  This of course, hammers power. But fortunately in the gene expression world we often have very large effect sizes.
  
  You comment that the before trials issue isn’t resolved in the frequentist world. I mostly agree (is the fiducial approach frequentist? is Fisher doing anything other than making a mistake we would discard in a lesser genius?) . I don’t think any of our philosophies of inference are complete. It is possible to find test cases that reveal problems with each of them. That is why statisticians were the original post modernists. We pick up and use a philosophy depending on the context. We would have more success searching for the philosophers stone than in trying to find a universally applicable logic of statistical inference.
  
  Reply
joel rubinson says:

September 20, 2014 at 9:47 am

my perspective is based on marketing research. I say the following at conferences in in various blog posts (blog.joelrubinson.net) 80% of new products fail, 50% of ad campaigns show no lift in sales, yet market research tests things at the 90% confidence interval. What is wrong with this picture?

Reply
1. matloff says:
  
  September 20, 2014 at 10:48 pm
  
  This actually is related to a big discussion in Statistics these days — reproducibility. Lots of published papers turn out not to replicate. UCB’s Phil Stark organized a great session on this at this year’s JSM. A big problem is that the population one is sampling from is different in the original study from a subsequent one, due to differing lab conditions, for example. I suspect something like that is at work in your marketing examples.
  
  Reply
  1. joel rubinson says:
    
    September 21, 2014 at 2:01 pm
    
    there is another problem. asking questions where people are generating opinions on the fly. When you segment consumers on attitudinal questions, and then re-administer the same survey to the same people say 3 months later you will only get 50-60% of respondents being classified into the same groups. also, if you conduct the same survey across many online panels, the answers on attitudinal questions will vary by much more than sampling error although fact-like questions (e.g. do you pay a monthly mortgage, do you smoke at least 1 cigarette per week) do not vary.
    
    Reply
2. Mayo says:
  
  September 26, 2014 at 8:36 pm
  
  If you’re testing then you’re not rejecting testing as Matloff recommends.
  But let me mention just a few of the problems with merely giving a confidence interval (with a testing supplement): the estimate treats all values in the interval on par, whereas in fact the members are warranted to extremely different extents. We often see reports like “no information” when a null and value of interest are in a CI–but there’s a lot of info!) They are just as dichotomous as tests: plausible vs implausible values—UNLESS they are supplemented with testing reasoning which leads to a series of confidence intervals, or limits (which happen to correspond to severity assessments). You also require testing reasoning to form the upper CI limit when doing a one-sided lower CI–else your inference is of form (CI-lower, infinity). It’s quite useless as well to have a report of a one sided lower when your result fails to be statistically significant and what you need is an upper bound.) These points are written less hastily in the links I posted but which remain unaddressed.
  
  Reply
  1. Mervyn Thomas says:
    
    September 26, 2014 at 9:33 pm
    
    I agree it is as unreasonable to reject a confidence interval because it treats all results within the interval as equivalent, as it is to reject a test because it reduces all results to a spurious dichotomy – and vice versa,
    
    Testing and CIs are strongly linked, both are highly condensed summaries of the jnformation in the likelihood function (or posterior distribution if you are Bayesian) taking due account of (or polluted by) sample space considerations. I think this is what you mean by a series of intervals or limits.
    
    We only get into problems when we use these tools slavishly and uncritically.
    
    I have seen many more problems caused by uncritical use of significance tests than by uncritical use of confidence intervals. But this could be a sample size issue – I’ve seen much more worrk unduly reliant in significance tests than on confidence intervals.
    
    We have to be careful of polemics because none of our systems of inference are entirely satisfactory.
    
    Reply
  2. matloff says:
    
    September 26, 2014 at 11:11 pm
    
    I certainly agree that CIs are often misused, such as your “no information” example clearly shows. But that is certainly not the way I view it. I’d suggest in particular that you read the example in my book in which the reader is asked to imagine serving as a campaign consultant. As I said earlier here, using a CI to check whether it contains a value of interest is defeating the purpose of the CI.
    
    The real problem is that many people wants statistics to make their decisions for them.
    
    Reply
    1. Mayo says:
      
      September 26, 2014 at 11:23 pm
      
      It is not misusing them–they are not intended to distinguish points or give warrants for specified discrepancies from values of interest (as tests are). One needs to add a rationale for distinguishing the points, and that rationale is a testing one. Long run coverage won’t do.
      
      Reply
matloff says:

September 26, 2014 at 11:15 pm

Replying to Mervyn: Why must there be a likelihood function in the first place? Inference based on the CLT doesn’t use it, say for means (including regression functions). In more complex situations, maybe MLEs and the like are “necessary evils,” but I believe that people tend to reach that conclusion too quickly.

Reply
1. matloff says:
  
  September 26, 2014 at 11:33 pm
  
  Deborah, you’ve lost me here. Who is “distinguishing points,” and in what sense?
  
  In terms of misuse, I was merely agreeing with your point that statements like “No information” are not warranted.
  
  Reply
  1. Mayo says:
    
    October 5, 2014 at 9:32 pm
    
    I was sying CIs weren’t intended/presented to serve the goal of distinguishing points. One needs rationale. that’s what SEV gives.
    can you explain CLT inference w/o likelihoods?
    
    Reply
  2. matloff says:
    
    October 6, 2014 at 8:23 pm
    
    In response to Deborah’s question, “Can you explain CLT inference w/o likelihoods?”, see my discussion with Mervyn.
    
    Reply
Mervyn Thomas says:

September 27, 2014 at 12:06 am

Surely inference based on the CLT is making use of the fact that the relevant statistic converges in distribution to a Normal, hence giving us an asymptoticaly normal likelihood – on which we base our asymptoticaly normal CIs and tests.

A likelihood function is not used simply to obtain an MLE, it is used (in a sense) to represent the sample information about the parameters conditional on the data.

Reply
1. matloff says:
  
  September 27, 2014 at 9:17 am
  
  Inference based on the CLT makes use of convergence in distribution, as you note. So our inference is based on the normal cdf, not the normal density, hence not on likelihood.
  
  And it’s even worse than that, because the CLT does NOT say that the density of the sample statistic converges to a normal density, and indeed it might not do so.
  
  Reply
Mervyn Thomas says:

September 27, 2014 at 3:36 pm

Well yes, but we do have proofs of L1 convergence of the density and of Kullback Leibler convergence of the density, and this seems to me to be enough to motivate the idea of an asymptotic likelihood.
I am sure it is possible to dream up pathological examples where this breaks down. Imagining pathological cases in which a competeing system fails has been one of the minor vices of statisticians for generations. The question is does it matter in practice?

You do have an interesting blog. It challenges me and makes me reflect – and that’s something that happens too rarely in the hurly burly of statistical practice 🙂

Reply
1. matloff says:
  
  September 27, 2014 at 4:05 pm
  
  Those density convergence theorems, when applied to something like MLE, still assume the model is correct, and more important, assume that there actually is such thing as a density. 🙂 Since in practice all data is discrete (e.g. due to finite precision of measurement), there are as many densities as there are unicorns. 🙂
  
  At any rate, we still have the issue that CLT-based inference only uses the normal cdf, not the normal density, so that likelihood really doesn’t enter into the picture.
  
  Reply
Mervyn Thomas says:

September 27, 2014 at 4:26 pm

I think it’s downright ungentlemanly to remind us that continuous distributional models are a (convenient) fiction 🙂 and in even worse taste to remind us that all models are based in assumptions which are, in an absolute sense, wrong 🙂

Suggesting that a response is in bad taste is, of còurse, proof positive that I can’t muster a coherrent argument against you 🙂

Reply
1. Norm Matloff says:
  
  September 27, 2014 at 5:39 pm
  
  Well, that’s what we rabblerousers are for. 🙂
  
  Reply
Mayo says:

September 27, 2014 at 9:27 pm

I reviewed the part in your book on tests vs CIs. It was quite as extreme as I’d remembered it. I’m so used to interpreting significance levels and p-values in terms of discrepancies warranted or not that I automatically have those (severity) interpretations in mind when I consider tests. Fallacies of rejection and acceptance, relativity to sample size–all dealt with, and the issues about CIs requiring testing supplements remain (especially in one-sided testing which is common). This paper covers 13 central problems with hypothesis tests, and how error statistics deals with them.

Click to access Error_Statistics_2011.pdf

I remember many of the things I like A LOT about Matloff’s book. I’m glad he sees CIs as the way to go for variable choice (on prediction grounds) because it means that severity is relevant there too.

Reply
1. Norm Matloff says:
  
  September 28, 2014 at 11:46 pm
  
  Looks like a very interesting paper, Deborah (as I would have expected). I look forward to reading it. Just skimming through, though, it looks like I’ll probably have comments similar to the ones I made on Mervyn’s points.
  
  Going back to my original post, do you at least agree that CIs are more informative than tests?
  
  Reply
Laura Chihara says:

October 1, 2014 at 2:32 am

There are some newer texts introducing inference via resampling rather than the classical z/t-tests:
at the Intro Stats level:
Lock, Lock, Lock, Lock, Lock: “Statistics: Unlocking the Power of Data”

http://www.lock5stat.com/

and at the Math Stats level
Chihara, Hesterberg: “Mathematical Statistics with Resampling and R”

http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118029852.html

Also,I know of another Intro Stats text starting with the resampling approach in the works (being published by Wiley).

Reply
Mayo says:

October 5, 2014 at 9:39 pm

NYT article with slight ref to me

my post in relation to it.

Oy Faye! What are the odds of not conflating simple conditional probability and likelihood with Bayesian success stories?

Reply
1. matloff says:
  
  October 6, 2014 at 8:35 pm
  
  I agree that that NYT article was a terrible mischaracterization of the Bayesian approach. (As always, when I say Bayesian, I am referring only to use of subjective priors.)
  
  I’m see this phrasing, “Bayesians update their information,” a lot these days. If I didn’t know better, I’d guess that some group of Bayesians retained a public relations expert, to come up with this slogan.
  
  Reply
  1. Mayo says:
    
    October 6, 2014 at 10:03 pm
    
    Matloff: they also downdate, and will happily change their prior to arrange to get the posterior they want after looking at the data. There’s nothing to stop this.
    
    Reply
    1. Mervyn Thomas says:
      
      October 6, 2014 at 10:47 pm
      
      @ Deborah – many Bayesians will use different priors in an analysis, I have done so myself – but only to demonstrate that the posterior is robust to the use of different priors. One approach that I use when working with informative priors is to look at the effective sample size (See Morita et al, Bayesian Anal. Volume 7, Number 3 (2012), 591-614). If the prior effective sample size is not small with respect to the study sample size, then there is a problem. Essentially, it says the inference is dominated by my preconceptions rather than the data.
      
      Of course, as you point out, there is no requirement to do this from a Bayesian perspective. It is however, common practice. I think that there may well be more opportunities to fudge the result from a Bayesian analysis than from a frequentist one. In part that is because people are less familiar with the technical details of Bayesian methods, and therefore less able to be critical.
      
      Most applied statisticians are pretty eclectic, we will use Bayesian methods when they provide a computationally feasible way of solving a problem when other methods are less readily applicable. The choice is usually driven by practicality rather than a commitment to the Bayesian philosophy.
      
      Reply
Pingback: Why Are We Still Teaching t-Tests? | Mad (Data) Scientist | Moritz S. Schmid

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

Why Are We Still Teaching t-Tests?

68 thoughts on “Why Are We Still Teaching t-Tests?”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science

Share this:

Related

68 thoughts on “Why Are We Still Teaching t-Tests?”

Leave a comment Cancel reply

Musings, useful code etc. on R and data science