The Notion of “Double Descent”

I tend to be blase’ about breathless claims of “new” methods and concepts in statistics and machine learning. Most are “variations on a theme.” However, the notion of double descent, which has come into prominence in the last few years, is something I regard as genuinely new and very relevant, shedding important light on the central issue of model overfitting.

In this post, I’ll explain the idea, and illustrate it using code from my regtools R package.  (See also a related discussion, with a spline example by Daniela Witten.)

The idea itself is not new, but is taking special importance these days, in its possibly shedding light on deep learning. It seems to suggest an answer to the question, “Why do DL networks often work well, in spite of being drastically overparameterized?”

Classical statistical thinking, including in my own books, is that the graph of loss L e.g. misclassification rate, on new data against model complexity C should be U-shaped. As C first moves away from 0, bias is greatly reduced while variance increases only slightly. The curve is in descent. But once C passes a minimum point, the curve will go back up, as variance eventually overwhelms bias.

(Technical note: Bias here is the expected value of the difference between the predicted values of the richer and coarser models, under the richer one, and taken over the distribution of the predictors.)

As C increases, we will eventually reach the point of saturation (nowadays termed interpolation), in which we have 0 training error. A common example is linear regression with a polynomial model in a single predictor variable. Here C is the degree of the polynomial. If we have n = 100 data points, a 99th-degree polynomial will fit the training data exactly — but will be terrible in predicting new data.

But what if we dare to fit a polynomial of higher degree than 99 anyway, i.e. what if we deliberately overfit? The problem now becomes indeterminate — there is now no unique solution to the problem of minimizing the error sum of squares. There are in fact infinitely many solutions. But actually, that can be a virtue; among those infinitely many solutions, we may be able to choose one that is really good.

“Good” would of course mean that it is able to predict new cases well. How might we find such a solution?

I’ll continue with the linear regression example, though not assume a polynomial model. First, some notation. Let β denote our population coefficient vector, and b its estimate. Let ||b|| denote the vector norm of b. Let p denote the number of predictor variables. Again, if we have p predictor variables, we will get a perfect fit in the training set if p = n-1.

If you are familiar with the LASSO, you know that we may be able to do better than computing b to be the OLS estimate; a shrunken version of OLS may do better. Well, which shrunken version? How about the minimum-norm solution?

Before the interpolation point, our unique OLS solution is the famous

b = (X’X)-1 X’Y

This can also be obtained as b = X Y where X is a generalized inverse of X. It’s the same as the classic formula before interpolation, but it can be used after the interpolation point as well. And the key point is then that one implementation of this, the Moore-Penrose inverse, gives the minimum-norm solution.

This minimum-norm property reduces variance. So, not only will MP allow us to go past the interpolation point, it will also reduce variance, possibly causing the L-C curve to descend a second time! We now have a double U-shape (there could be more).

And if we’re lucky, in this second U-shape, we may reach a lower minimum than we had in the original one. If so, it will have paid to overfit!

There some subtleties here. The minimum norm means, minimum among all solutions for fixed p. As p increases, the space over which we are minimizing grows richer, thus more opportunities for minimizing the norm. On the other hand, the variance of that minimum norm solution is increasing with p. Meanwhile the bias is presumably staying constant, since all solutions have a training set error of 0. So the dynamics underlying this second U would seem to be different from those of the first.

Now, what about the implication for deep learning? DL computation typically uses stochastic gradient descent, and there has been theoretical and empirical work indicating that SGD tends to settle on a minimum-norm solution. This may explain the examples of success of DL in spite of overfitting.

Empirical illustration:

Here I’ll work with the Million Song dataset. It consists of 90 audio measurements made on about 500,000 songs (not a million) from 1922 to 2011. The goal is to predict the year of release from the audio.

I took the first p predictors, varying p from 2 to 30, and fit a quadratic model, with O(p2) predictors resulting.

One of the newer features of regtools is its qe*-series (“Quick and Easy”) of functions. They are extremely simple to use, all having the call format qeX(d,’yname’), to predict the specified Y in the data frame d. Again, the emphasis is on simplicity; the above call is all one needs, so for example there is no preliminary code for defining a model. They are all wrappers for standard R package functiona, and are paired with predict() and plot() wrappers.

Currently there are 9 different machine learning algorithms available, e.g qeSVM() and qeRF(), for SVM and random forests. Here we will use qePoly(), which wraps our polyreg package. Note by the way that the latter correctly handles dummy variables (e.g. no powers of a dummy are formed). Note too that qePoly() computes the Moore-Penrose inverse if we are past interpolation, using the function ginv() from the MASS package.

Again to make this experiment on a meaningful scale, I generated random training sets of size n = 250, and took the rest of the data as the test set. I used from 2 to 30 predictor variables, and used Mean Absolute Prediction Error as my accuracy criterion. Here are the results,

Well, sure enough, there it is, the second U. The interpolation point is between 22 and 23 predictors (there is no “in-between” configuration), where there are 265 parameters, overfitting our n = 250.

Alas, the minimum in the second U is not lower than that of the first, so overfitting hasn’t paid off here. But it does illustrate the concept of double-descent.

Code:

 

overfit <-  function(nreps,n,maxP)
 {
    load('YearData.save') 
    nas <- rep(NA,nreps*(maxP-1))
    outdf <- data.frame(p=nas,mape=nas)
    rownum <- 0
    for (i in 1:nreps) {
       idxs <- sample(1:nrow(yr),n)
       trn <- yr[idxs,] 
       tst <- yr[-idxs,]           
       for (p in 2:maxP) {
          rownum <- rownum + 1
          out<-qePoly(trn[,1:p+1)],
'V1',2,holdout=NULL) preds <- predict(out,tst[,-1]) mape <- mean(abs(preds - tst[,1])) outdf[rownum,1] <- p outdf[rownum,2] <- mape print(outdf[rownum,]) } } outdf #run through tapply() for the graph }

Efron Updates breiman’s “two cultures” essay

The June 2020 issue of JASA features a highly insightful essay by Brad Efron, dean of the world’s statisticians. The article is accompanied by commentary by a number of statistical luminaries. Important, indeed central questions are raised. I would hope JASA runs more pieces of this nature.

The essay may be viewed as an update of the classic 2001 article by Leo Breiman, regarding two largely separate cultures, statistics and machine learning (ML), updating not only in the sense of progress made since 2001, but also in the divide between Prediction and Description (“Attribution”) goals of regression analysis. (I of course use the latter term to mean estimation of E(Y | X), whether parametrically or not.)

I’ll provide my own commentary here, and, this being a statistics/R blog, bring in some related comments on R. I’ll have two main themes:

  • The gap between the two cultures is as wide today as ever.
  • There are fundamental problems with both cultural approaches, with a lot of “the emperor has no clothes” issues.

Parametric vs. Nonparametric Methods

Note carefully that the cultural difference is NOT one of parametric vs. nonparametric methods. For instance, statisticians were working on k-Nearest Neighbors as far back as the 1950s, if not earlier. Tree-based methods, notably random forests, were developed largely by statisticians in the 1970s, and arguably the best implementations are in R, the statisticians’ language, rather than Python, the MLers’ language.

Sampling Variation

Instead, I believe the widest gap between the two cultures involves sampling variation. It forms the very core of statistics, while ML mostly ignores it. I’ve observed this over the years in writings and statements by ML people, and in conversation with them.

This was exemplified in a recent Twitter exchange I had with a prominent ML researcher. I’d lamented that ML people don’t care about sampling variation, and the ML person responded, “What about causal models? Sampling variation is not an issue there.” But of course it IS an issue, a major one. Is the causal effect one seems to have found real, or just a sampling accident?

By the way, this extends to the research level, resulting in statistical work often focusing on asymptotics, while in ML the focus is on proving finite inequalities about the data itself, no connection to a population.

Frankly, I find that ML people just don’t seem to find statistics relevant to their work, and often don’t know much about stat. For example, in a book written by another prominent MLer, he recommends standardizing data to mean-0, variance-1 form before performing neural network analysis. Indeed, for reasons that seem to be unknown, this is mandatory, as the method may not converge otherwise. But I was shocked to see the MLer write that standardization presumes a Gaussian distribution, which is certainly not true.

R vs. Python (and R vs. R)

As noted, statisticians tend to use R, while ML people almost universally use Python. This in turn stems from the fact that ML has basically become a computer science field. Python is common, virtually standard in CS coursework and much of the CS research, so it is natural for MLers to gravitate to that language. R, of course, has been the lingua franca of stat.

That’s all fine, but it is adding to the cultural gap, to the detriment of both camps. Years ago I wrote, “R is written by statisticians, for statisticians” and that R is Statistically Correct. Sadly, in my view the Python ML software doesn’t meet that standard, which again stems from the general lack of statistical expertise in ML. On the other hand, there are a number of “full pipeline” packages in Python, notably for image classification, with nothing comparable in R.

To their credit, JJ Allaire of RStudio and Wes McKinney of the Python world have worked on tools to “translate” between the two languages. But I’m told by a prominent R developer that there is resistance on the Python side. Hopefully this will change.

Meanwhile, R itself is undergoing its own cultural split(s). On the language level, there is the base-R vs. Tidyverse split. (I consider the latter to be taking R in the wrong direction). But beyond that, a large and growing segment of R users see the language only as a tool for data wrangling (e.g. case filtering) rather than statistics. It should be viewed as both, and the lack of such a view will be problematic for stat users in the coming years, I believe.

“The Emperor Has No Clothes”

There are serious problems within each of the two cultures. So far, analysts have succeeded in blithely ignoring them, but I worry that this can’t continue.

Heavy use of cross-validation:

Efron notes, “A crucial ingredient of modern prediction methods is the training/test set paradigm…” This is combined with grid search to find the “best” combination of tuning parameters. (Called hyperparameters in the ML field. ML has its own terminology, e.g. inference for what stat people call prediction. Yet another cultural divide.)

For some ML methods, the number of tuning parameters can be large, and when combined in Cartesian product with many values for each one, we have a very large search space. That in turn raises a simultaneous inference problem, colloquially known today as p-hacking. The parameter combination that appears best may be far from best.

The fineTuning() function in my regtools package computes Bonferroni (Dunn) intervals for the various combinations as a way to partially deal with the problem. It also features a graphical exploration tool.

Similarly, some of those dazzling results from ML methods in competitions need to be taken with a grain of salt. There presumably is a p-hacking problem here as well; different contestants keep trying various parameter values and algorithm tweaks until they get a better value — whether by genuinely better technique or by random accident. A probability record-values analysis would be interesting here. Also see the impressive empirical findings in Rebecca Roelofs’ dissertation.

Smoothness assumptions:

To me, the most interesting, thought-provoking part of Efron’s essay is the material on smoothness. But it brings to mind another “emperor has no clothes” issue.

The fact is that in real life, there is no such thing as smoothness, since all measurements are discrete rather than continuous. At best, human height is measured, say, only to 0.2 centimeter and air temperature to, say, 0.1 degree. Hence there are not even first derivatives, let alone higher-order ones. This calls into question stat theory (“Assume a continuous second derivative…”), and more importantly, any formulaic (as opposed to aesthetic) method for choosing the smoothness of a fitted curve.

Overfitting:

Classic parametric theory has results along the lines of needing p = o(sqrt(n)) to obtain consistent estimates. There are various nonparametric results of this nature. There are more recent results regarding p > n, but with conditions that are typically unverifiable.

On the ML side, adherents have shown a number of successes in which p >> n. A typical “optimal” neural network may, after all the dust clears, have tens of millions of parameters (hidden-layer weights) with n only in the tens of thousands. Some NN specialists have even said overfitting is actually desirable. Yet, to my knowledge, there is no theoretical justification for this. (Efron notes this too.) Is it true? Or is there something else going on?

Neural networks as black boxes:

For that matter, what makes NNs work anyway? Efron describes them as “essentially elaborate logistic regression programs.” With a sigmoid activation function, and with the understanding that logits are composed together rather than, say added, that is true. However, in our work we show something more basic: NNs are essentially performing good old-fashioned polynomial regression. This has practical implications, e.g. that multicollinearity among the layer outputs increases from layer to layer. (We also have a corresponding R package, polyreg.)

“Rectangular” data:

In recent years, it has been found that for most applications, NNs will not do especially well, unlike their succeses with image classification for instance. Out of this has come Conventional Wisdom statements like, “For rectangular data, you may do better with non-NN methods.”

But upon closer inspection, one sees this particular “emperor” to be especially unclothed. Consider the famous MNIST data, consisting of 65,000 28×28 images. One can store this is a 65000 x 784 matrix (2ix28 = 784). In other words, n = 65000 and p = 784. Isn’t that rectangular? One prominent MLer uses the term perceptive data instead. That is not a satisfying explanation at all, nor, as noted, is the “rectangular” one.

So, are image and natural language data, putative areas of excellence for NNs, really different from all other areas? This seems unlikely. Instead, the likely explanation is that enormous efforts were expended in these areas, and NNs happened to be the tool that was used. Indeed, some researchers have taken the outputs of convolutional layers (which are just classic image operations), and run them through SVMs, obtaining results equal to or better than NNs.


So, what do we have? This discussion brings to mind an old paper, delightfully titled “What’s Not What What in Statistics.” There is so much more than could be added today.

New Book on Machine Learning

I’m nearing completion of writing my new book, The Art of Machine Learning: Algorithms+Data+R, to be published by the whimsically named No Starch Press. I’m making a rough, partial draft available, and welcome corrections, suggestions and comments.

I’ve been considering doing such a project for some time, intending to write a book that would on the one hand serve as “machine learning for the masses” while ardently avoiding being of a “cookbook” nature. In other words, the book has two goals:

  • The math content is kept to a minimum. Readers need only be able to understand scatter plots and the like, and know the concept of the slope of a line. (For readers who wish to delve into the math, a Math Companion document will be available.)
  • There is strong emphasis on building a solid intuitive understanding of the methods, empowering the reader to conduct effective, penetrating ML analysis.

As I write in the preface (“To the Reader”),

“Those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight.”

The language of instruction is R, using standard CRAN packages. But as I also write,

“…this is a book on ML, not a book on using R in ML. True, R plays a major supporting role and we use prominent R packages for ML throughout the book, with code on almost every page. But in order to be able to use ML well, the reader should focus on the structure and interpretation of the ML models themselves; R is just a tool toward that end.”

So, take a look, and let me know what you think!

FOCI: a new method for feature selection

It’s been a while since I’ve posted here, not for lack of material but just having too long a TO DO list. I’ve added some important features to my regtools package, for instance, and hope to discuss them here soon.

For the present post, though, I will discuss FOCI, a highly promising new approach to feature selection, developed by Mona Azadkia and Sourav Chatterjee of the Stanford Stat Dept. There is now an R package of the same name on CRAN.

I am one of the authors of the package, but was not involved in developing the method or the associated theory. My connection here was accidental. A few months ago, Sourav gave a talk on FOCI at Stanford. I must admit that I had been skeptical when I read the abstract. I had seen so many proposals on this topic, and had considered them impractical, full of unverifiable assumptions and rather opaque indications regarding asymptotic behavior. So, I almost didn’t attend, even though I was on the Stanford campus that day.

But I did attend, and was won over. The method has only minimal assumptions, explicit asymptotic behavior, and lo and behold, no tuning parameters.

I tried FOCI on some favorite datasets that evening, and was pleased with the results, but the algorithm was slow even on moderate-sized datasets. So I wrote to Sourav, praising the method but suggesting that they parallelize the software. He then asked if I might get involved in that, which I did.

You can find Mona and Sourav’s arXiv paper at https://arxiv.org/pdf/1910.12327.pdf, which extends earlier work by Sourav. The theory is highly technical, I’d say very intricate even by math stat standards, thus not for the faint of heart. Actually, it was amusing during Sourav’s talk when one of the faculty said, “Wait, don’t leave that slide yet!”, as the main estimator (T_n, p.3) shown on that slide, is a challenge to interpret.

The population-level quantity, Eqn. (2.1), is more intuitive. Say we are predicting Y from a vector-valued X, and are considering extending the feature set to (X,Z). The expression is still a bit complex — and please direct your queries to Mona and Sourav for details, not to me 🙂 — but at its core it is comparing Var(Y | X,Z) to Var(Y | X). How much smaller is the former than the latter? Of course, there are expectations applied etc., but that is the intuitive heart of the matter.

So here is a concrete example, the African Soil remote sensor dataset from Kaggle. After removing ID, changing a character column to R factor then to dummies, I had a data frame with n = 1157 and p = 3578. So p > n, with p being almost triple the size of n, a nice test for FOCI. For Y, there are various outcome variables; I chose pH.

The actual call is simple:

fout <- foci(afrsoil[,3597],afrsoil[,1:3578])

This specifies the outcome variable and features. There are some other options available. The main part of the output is the list of selected variables, in order of importance. After that, the cumulative FOCI correlations are shown:

> fout$selectedVar
    index    names
 1:  3024 m1668.14
 2:  3078    m1564
 3:  2413 m2846.45
 4:  3173  m1380.8
 5:  3267 m1199.52
 6:  1542 m4526.16
 7:  2546 m2589.96
 8:  3537 m678.828
 9:  3469 m809.965
10:  2624 m2439.54
11:  1874  m3885.9
12:  2770 m2157.98
> fout$stepT[1:12]
 [1] 0.1592940 0.3273176 0.4201255 0.5271957 0.5642336 0.5917753 0.6040584
 [8] 0.6132360 0.6139810 0.6238901 0.6300294 0.6303839

Of course, those feature names, e.g. ‘m1564’, are meaningful only to a soil scientist, but you get the idea. Out of 3578 candidate features, FOCI chose 12, attaining very strong dimension reduction. Actually, the first 8 or so form the bulk of the selected feature set, with the remainder yielding only marginal improvements.

But is it a GOOD feature set? We don’t know in this case, but Mona and Sourav have a very impressive example in the paper, in which FOCI outperforms random forests, LASSO, SCAD and so on. That, combined with the intuitive simplicity of the estimator etc., makes FOCI a very promising addition to anyone’s statistical toolkit. I’m sure more empirical comparisons are forthcoming.

And what about the computation speed? Two types of parallelism are available, with best results generally coming from using both in conjunction: 1. At any step, the testing of the remaining candidate features can be done simultaneously, on multiple threads on the host machine. 2. On a cluster or in the cloud, one can ask that the data be partitioned in chunks of rows, with each cluster node applying FOCI to its chunk. The union of the results is then formed, and fed through FOCI one more time to adjust the discrepancies. The idea is that that last step will not be too lengthy, as the number of candidate variables has already been reduced. A cluster size of r may actually produce a speedup factor of more than r (Matloff 2016).

Approach (2) also has the advantage that it gives the
user an idea of the degree of sampling variability in the FOCI
results.

So there you have it. Give FOCI a try!

R Journal July Issue

As the current Editor-in-Chief of the R Journal, I must apologize for the delay in getting the July issue online, due to technical and other matters. In the meantime, though, please take a look at the many interesting articles slated for publication in this and upcoming issues.

Various improvements in technical documentation, as well as the pending hire of the journal’s first-ever editorial assistant, should shorten the review and publication processes in the future.

By the way, I’ve made a couple of tweaks to the Instructions for Authors. First, I note that the journal’s production software really does require following the instructions carefully. For instance, the \author field in your .tex file must start with “by” in order to work properly; it’s not merely a matter of, say, aesthetics. And your submission package must not have more than one .bib file or more than one .tex file other than RJwrapper.tex.

Finally, a reminder to those whom we ask to review article submissions: Your service would be greatly appreciated, a valuable contribution to R. If you must decline, though, please respond to our e-mail query, stating so, so that we may quickly search for other possible reviewers.

Prob/Stat for Data Sci: Math + R + Data

My new book, Probability and Statistics for Data Science: Math + R + Data, pub. by the CRC Press, was released on June 24!

This book arose from an open-source text I wrote and have been teaching from. The open source version will still be available, though rather different from the published one.

This is a math stat book, but different from all others, as the subtitle states: Math + R + Data. Even the topic ordering is rather different, with a goal of bringing in data science-relevant material as early as possible.

I’ve placed an excerpt from the book at http://tinyurl.com/y6hf7x66. I believe it epitomizes the intent and style of the book. Also, I’ve placed the front matter at http://tinyurl.com/yy9hx6db 

Musings, useful code etc. on R and data science