The Notion of “Double Descent”

I tend to be blase’ about breathless claims of “new” methods and concepts in statistics and machine learning. Most are “variations on a theme.” However, the notion of double descent, which has come into prominence in the last few years, is something I regard as genuinely new and very relevant, shedding important light on the central issue of model overfitting.

In this post, I’ll explain the idea, and illustrate it using code from my regtools R package.  (See also a related discussion, with a spline example by Daniela Witten.)

The idea itself is not new, but is taking special importance these days, in its possibly shedding light on deep learning. It seems to suggest an answer to the question, “Why do DL networks often work well, in spite of being drastically overparameterized?”

Classical statistical thinking, including in my own books, is that the graph of loss L e.g. misclassification rate, on new data against model complexity C should be U-shaped. As C first moves away from 0, bias is greatly reduced while variance increases only slightly. The curve is in descent. But once C passes a minimum point, the curve will go back up, as variance eventually overwhelms bias.

(Technical note: Bias here is the expected value of the difference between the predicted values of the richer and coarser models, under the richer one, and taken over the distribution of the predictors.)

As C increases, we will eventually reach the point of saturation (nowadays termed interpolation), in which we have 0 training error. A common example is linear regression with a polynomial model in a single predictor variable. Here C is the degree of the polynomial. If we have n = 100 data points, a 99th-degree polynomial will fit the training data exactly — but will be terrible in predicting new data.

But what if we dare to fit a polynomial of higher degree than 99 anyway, i.e. what if we deliberately overfit? The problem now becomes indeterminate — there is now no unique solution to the problem of minimizing the error sum of squares. There are in fact infinitely many solutions. But actually, that can be a virtue; among those infinitely many solutions, we may be able to choose one that is really good.

“Good” would of course mean that it is able to predict new cases well. How might we find such a solution?

I’ll continue with the linear regression example, though not assume a polynomial model. First, some notation. Let β denote our population coefficient vector, and b its estimate. Let ||b|| denote the vector norm of b. Let p denote the number of predictor variables. Again, if we have p predictor variables, we will get a perfect fit in the training set if p = n-1.

If you are familiar with the LASSO, you know that we may be able to do better than computing b to be the OLS estimate; a shrunken version of OLS may do better. Well, which shrunken version? How about the minimum-norm solution?

Before the interpolation point, our unique OLS solution is the famous

b = (X’X)-1 X’Y

This can also be obtained as b = X Y where X is a generalized inverse of X. It’s the same as the classic formula before interpolation, but it can be used after the interpolation point as well. And the key point is then that one implementation of this, the Moore-Penrose inverse, gives the minimum-norm solution.

This minimum-norm property reduces variance. So, not only will MP allow us to go past the interpolation point, it will also reduce variance, possibly causing the L-C curve to descend a second time! We now have a double U-shape (there could be more).

And if we’re lucky, in this second U-shape, we may reach a lower minimum than we had in the original one. If so, it will have paid to overfit!

There some subtleties here. The minimum norm means, minimum among all solutions for fixed p. As p increases, the space over which we are minimizing grows richer, thus more opportunities for minimizing the norm. On the other hand, the variance of that minimum norm solution is increasing with p. Meanwhile the bias is presumably staying constant, since all solutions have a training set error of 0. So the dynamics underlying this second U would seem to be different from those of the first.

Now, what about the implication for deep learning? DL computation typically uses stochastic gradient descent, and there has been theoretical and empirical work indicating that SGD tends to settle on a minimum-norm solution. This may explain the examples of success of DL in spite of overfitting.

Empirical illustration:

Here I’ll work with the Million Song dataset. It consists of 90 audio measurements made on about 500,000 songs (not a million) from 1922 to 2011. The goal is to predict the year of release from the audio.

I took the first p predictors, varying p from 2 to 30, and fit a quadratic model, with O(p2) predictors resulting.

One of the newer features of regtools is its qe*-series (“Quick and Easy”) of functions. They are extremely simple to use, all having the call format qeX(d,’yname’), to predict the specified Y in the data frame d. Again, the emphasis is on simplicity; the above call is all one needs, so for example there is no preliminary code for defining a model. They are all wrappers for standard R package functiona, and are paired with predict() and plot() wrappers.

Currently there are 9 different machine learning algorithms available, e.g qeSVM() and qeRF(), for SVM and random forests. Here we will use qePoly(), which wraps our polyreg package. Note by the way that the latter correctly handles dummy variables (e.g. no powers of a dummy are formed). Note too that qePoly() computes the Moore-Penrose inverse if we are past interpolation, using the function ginv() from the MASS package.

Again to make this experiment on a meaningful scale, I generated random training sets of size n = 250, and took the rest of the data as the test set. I used from 2 to 30 predictor variables, and used Mean Absolute Prediction Error as my accuracy criterion. Here are the results,

Well, sure enough, there it is, the second U. The interpolation point is between 22 and 23 predictors (there is no “in-between” configuration), where there are 265 parameters, overfitting our n = 250.

Alas, the minimum in the second U is not lower than that of the first, so overfitting hasn’t paid off here. But it does illustrate the concept of double-descent.

Code:

 

overfit <-  function(nreps,n,maxP)
 {
    load('YearData.save') 
    nas <- rep(NA,nreps*(maxP-1))
    outdf <- data.frame(p=nas,mape=nas)
    rownum <- 0
    for (i in 1:nreps) {
       idxs <- sample(1:nrow(yr),n)
       trn <- yr[idxs,] 
       tst <- yr[-idxs,]           
       for (p in 2:maxP) {
          rownum <- rownum + 1
          out<-qePoly(trn[,1:p+1)],
'V1',2,holdout=NULL) preds <- predict(out,tst[,-1]) mape <- mean(abs(preds - tst[,1])) outdf[rownum,1] <- p outdf[rownum,2] <- mape print(outdf[rownum,]) } } outdf #run through tapply() for the graph }

2 thoughts on “The Notion of “Double Descent””

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.