Base-R Is Alive and Well

As many readers of this blog know, I strongly believe that R learners should be taught base-R, not the tidyverse. Eventually the students may settle on using a mix of the two paradigms, but at the learning stage they will benefit from the fact that base-R is simple and more powerful. I’ve written my thoughts in a detailed essay.

One of the most powerful tools in base-R is tapply(), a workhorse of base-R. I give several examples in my essay in which it is much simpler and easier to use that function instead of the tidyverse.

Yet somehow there is a disdain for tapply() among many who use and teach Tidy. To them, the function is the epitome of “what’s wrong with” base-R. The latest example of this attitude arose in Twitter a few days ago, in which two Tidy supporters were mocking tapply(), treating it as a highly niche function with no value in ordinary daily usage of R. They strongly disagreed with my “workhorse” claim, until I showed them that in the code of ggplot2, Hadley has 7 calls to tapply(),

So I did a little investigation of well-known R packages by RStudio and others. The results, which I’ve added as a new section in my essay, are excerpted below.


All the breathless claims that Tidy is more modern and clearer, whilc base-R is old-fashioned and unclear, fly in the face of the fact that RStudio developers, and authors of other prominent R packages, tend to write in base-R, not Tidy. And all of them use some base-R instead of the corresponding Tidy constructs.

package *apply() calls mutate() calls
brms 333 0
broom 38 58
datapasta 31 0
forecast 82 0
future 71 0
ggplot2 78 0
glmnet 92 0
gt 112 87
knitr 73 0
naniar 3 44
parsnip 45 33
purrr 10 0
rmarkdown 0 0
RSQLite 14 0
tensorflow 32 0
tidymodels 8 0
tidytext 5 6
tsibble 8 19
VIM 117 19

Striking numbers to those who learned R via a tidyverse course. In particular, mutate() is one of the very first verbs one learns in a Tidy course, yet mutate() is used 0 times in most of the above packages. And even in the packages in which this function is called a lot, they also have plenty of calls to base-R *apply(), functions which Tidy is supposed to replace.

Now, why do these prominent R developers often use base-R, rather than the allegedly “modern and clearer” Tidy? Because base-R is easier.

And if it’s easier for them, it’s even further easier for R learners. In fact, an article discussed later in this essay, aggressively promoting Tidy, actually accuses students who use base-R instead of Tidy as taking the easy way out. Easier, indeed!

Comments on the New R OOP System, R7

Object-Oriented Programming (OOP) is more than just a programming style; it’s a philosophy. R has offered various forms of OOP, starting with S3, then (among others) S4, reference classes, and R6, and now R7. The latter has been under development by a team broadly drawn from the R community leadership, not only the “directors” of R development, the R Core Team, but also the prominent R services firm RStudio and so on.

I’ll start this report with a summary, followed by details (definition of OOP, my “safety” concerns etc.). The reader need not have an OOP background for this material; an overview will be given here (though I dare say some readers who have this background may learn something too).

This will not be a tutorial on how to use R7, nor an evaluation of its specific features. Instead, I’ll first discuss the goals of the S3 and S4 OOP systems, which R7 replaces, especially in terms of whether OOP is the best way to meet those goals. These comments then apply to R7 as well.


Simply put, R7 does a very nice job of implementing something I’ve never liked very much. I do like two of the main OOP tenets, encapsulation and polymorphism, but S3 offers those and it’s good enough for me. And though I agree in principle with another point of OOP, “safety,” I fear that it often results in a net LOSS of safety. . R7 does a good job of combining S3 and S4 (3 + 4 = 7), but my concerns about complexity and a net loss in safety regarding S4 remain in full.


The first OOP language in wide use was C++, an extension of C that was originally called C with Classes. The first widely used language designed to be OOP “from the ground up” was Python. R’s OOP offerings have been limited.


This simply means organizing several related variables into one convenient package. R’s list structure has always done that. S3 classes then tack on a class name as attribute.


The term here, meaning “many forms,” simply means that the same function will take different actions when it is applied to different kinds of objects.

For example, consider a sorting operation. We would like this function to do a numeric sort if it is a applied to a vector of (real) numbers, but do an alphabetical sort on character vectors. Meanwhile, we would like to use the same function name, say ‘sort’.

S3 accomplishes this via generic functions. Even beginning R users have likely made calls to generic functions without knowing it. For instance, consider the (seemingly) ordinary plot() function. Say we call this function on a vector x; a graph of x will be displayed. But if we call lm() on some data, then call plot() on the output lmout R will display some graphs depicting that output:

mtc <- mtcars
plot(mtc$mpg) # plots mpg[i] against i
lmout <- lm(mpg ~ .,data=mtc)
plot(lmout)  # plots several graphs, e.g. residuals

The “magic” behind this is dispatch. The R interpreter will route a nominal call to plot() to a class-specific function. In the lm() example, for instance, lm() returns an S3 object of class ‘lm’, so the call plot(lmout) will actually be passed on to another function, plot.lm().

Other well-known generics are print(), summary(), predict() and coef().

Note that the fact that R and Python are not strongly-typed languages made polymorphism easy to implement. C++ on the other hand is strongly-typed, and the programmer will likely need to use templates, very painful.

By the way, I always tell beginning and intermediate R users that a good way to learn about functions written by others (including in R itself) is to run the function through R’s debug() function. In our case here, they may find it instructive to run debug(plot) and then plot(lmout) to see dispatch up close.


Say the domain is pets. We might have dogs named Norm, Frank and Hadley, cats named JJ, Joe, Yihui and Susan, and more anticipated in the future.

To keep track of them, we might construct a class ‘pets’, with fields for name and birthdate. But we could then also construct subclasses ‘dogs’ and ‘cats’. Each subclass would have all the fields of the top class, plus others specific to dogs or cats. We might then also construct a sub-subclass, ‘gender.’


Say you have a generic function defined for the class, with two numeric arguments, returning TRUE if the first is less than the second:

f <- function(x,y) x < y

But you accidentally call the function with two character strings as arguments. This should produce an error, but won’t

In a mission-critical setting, this could be costly. If the app processes incoming sales orders, say, there would be downtime while restarting the app, possibly lost orders, etc.

If you are worried about this, you could add error-checking code, e.g.

> f
function(x,y) {
   if (!is.numeric(x) || !is.numeric(y))
      stop('non-numeric arguments')
   x < y

More sophisticated OOP systems such as S4 can catch such errors for you. There is no free lunch, though–the machinery to even set up your function becomes more complex and then you still have to tell S4 that x and y above must be numeric–but arguably S4 is cleaner-looking than having a stop() call etc.

Consider another type of calamity: As noted, S3 objects are R lists. Say one of the list elements has the name partNumber, but later in assigning a new value to that element, you misspell at as partnumber:

myS3object <- partnumber

Here we would seem to have no function within which to check for misspelling etc. Thus S4 or some other “safe” OOP system would seem to be a must–unless we create functions to read or write elements of our object. And it turns out that that is exactly what OOP people advocate anyway (e.g. even in S4 etc.), in the form of getters and setters.

In the above example, for instance, say we have a class ‘Orders’, one of whose fields is partNumber. In S3, the getter in the above example would be named partNumber, and for a particular sales order thisOrder, one would fetch the part number via


rather than the more direct way of accessing an R list:

pn <- thisOrder$partNumber

The reader may think it’s silly to write special functions for list read and write, and many would agree. But the OOP philosophy is that we don’t touch objects directly, and instead have functions to act as intermediaries. At any rate, we could place our error-checking code in the getters and setters. (Although there still would be no way under S3 to prevent direct access.)


I use OOP rather sparingly in R, S3 in my own code, and S4, reference classes or R6 when needed for a package that I obtain from CRAN (e.g. ebimage for S4), In Python, I use OOP again for library access, e.g. threading, and to some degree, just for fun, as I like Python’s class structure.

But mostly, I have never been a fan of OOP. In particular, I never have been impressed by the “safety” argument. Here’s why:

Safety vs. Complexity

Of course, OOP does not do anything to prevent code logic errors, which are far more prevalent than, say, misspellings. And, most important:

  • There is a direct relation between safety and code complexity.
  • There is a direct relation between code logic errors and code complexity.

One of my favorite R people is John Chambers, “Father of the S Language” and thus the “Grandfather of R.” In his book, Software for Data Analysis, p.335, he warns that “Defining [an S4] class is a more serious piece of programming …than in previous chapters…[even though] the number of lines is not large…” He warns that things are even more difficult for the user of a class than it was for the author in designing it, with “advance contemplation” of what problems users may encounter. And, “You may want to try several different versions [of the class] before committing to one.”

In other words, safety in terms of misspellings etc. comes at possibly major expense in logic errors. There is no avoiding this.

There Are Other Ways to Achieve Safety:

As noted above, we do have alternatives to OOP in this regard, in the form of inserting our own error-checking code. (Note too that error-checking may be important in the middle of your code, using stopifnot().) Indeed, this can be superior to using OOP, as one has much more flexibility, allowing for more sophisticated checks.

Why the Push for R7 Now?

Very few of the most prominent developers of R packages use S4 as of now. One must conclude either that there is not a general urgency for safety and/or authors find that safety is more easily and effectively achieved through alternative means, per the above discussion.

As to encapsulation and inheritance, S3 already does a reasonably good job there. Why, then, push for R7?

The impetus seems to be a desire to modernize/professionalize R, moving closer to status as a general-purpose language. Arguably, OOP had been a weak point of R in that sense, and now R can hold its head high in the community of languages.

That’s great, but as usual I am concerned about the impact on the teaching of R to learners without prior programming experience. I’ve been a major critic of the tidyverse in that regard, as Tidy emphasizes “modern” functional programming/loop avoidance to students who barely know what a function is. Will R beginners be taught R7? That would be a big mistake, and I hope those who tend to be enthralled with The New, New Thing resist such a temptation.

Me, well as mentioned, I’m not much of an OOP fan, and don’t anticipate using R7. But the development team has done a bang-up job in creating R7, and for those who feel the need for a strong OOP paradigm, I strongly recommend it.

A Major Contribution to Learning R

Prominent statistician Frank Harrell has come out with a radically new R tutorial, rflow. The name is short for “R workflow,” but I call it “R in a box” –everything one needs for beginning serious usage of R, starting from little or no background.

By serious usage I mean real applications in which the user has a substantial computational need. This could be a grad student researcher, a person who needs to write data reports for her job, or simply a person who is doing personal analysis such as stock picking.

Like other tutorials/books, rflow covers data manipulation, generation of tables and graphics, etc. But UNLIKE many others, rflow empowers the user to handle general issues as they inevitably pop up, as opposed to just teaching a few basic, largely ungeneralizable operations. I’ve criticized the tidyverse in particular for that latter problem, but really no tutorial, including my own, has this key “R in a box” quality.

The tutorial is arranged into 19 short “chapters,” beginning with R Basics, all the way through such advanced topics as Manipulating Longitudinal Data and Parallel Computing. The exciting new Quarto presentation tool by RStudio is featured, as is the data.table package, essential for practical use of large datasets.

Note carefully that this tutorial is the product of Frank’s long experience “in the trenches,” conducting intensive data analysis in biomedical applications. (This specific field of application is irrelevant; rflow is just as useful to, say, marketing analysts, as it is for medicine.) His famous monograph, Regression Modeling Strategies, is a standard reference in the field. Even I, as the author of my own regression book, often find myself checking out what Frank has to say in his book about various topics.

This point about rflow arising from Frank’s long experience dealing with real data is absolutely key, in my view. And his choice of topics, and especially their ordering, reflects that. For instance, he brings in the topic of missing data early in the tutorial.

Anyone who teaches R, or is learning R, should check out rflow.

Greatly Revised Edition of Tidyverse Skeptic

As a longtime R user and someone with a passionate interest in how people learn, I continue to be greatly concerned about the use of the Tidyverse in teaching noncoder learners of R. Accordingly, I have now thoroughly revised my Tidyverse Skeptic essay. It is greatly reorganized with focus on teaching R, with a number of new examples, and some material on historical context of the rise of Tidy. I continue to on the one hand thank RStudio for its overall contribution to the R community but on the other believe that using Tidy for teaching beginners is actually an obstacle to learning for that group.

I close the essay by first noting that RStudio is now a Public Interest Corporation, thus with much broader public responsibility. I then renew a request I made to RStudio founder/CEO JJ Allaire when he met with me in 2019: “Please encourage R instructors to use a mixture of Tidy and base-R in their teaching.”

Please read the revised essay at the above link. Its Overview section is reproduced below.

  • Again, my focus here is on teaching R to those with little or no coding background. I am not discussing teaching Computer Science students.
  • Tidy was consciously designed to equip learners with just a small set of R tools. The students learn a few dplyr verbs well, but that equips them to do much less with R than a standard R beginners course would teach. That leaves the learners less equipped to put R to real use, compared to “graduates” of standard base-R courses.
  • Thus the “testimonials” in which Tidy teachers of R claim great success are misleading. The “success” is due to watering down the material (and false conflation with ggplot2). The students learn to mimic a few example patterns, but are not equipped to go further.
  • The refusal to teach ‘$’, and the de-emphasis of, or even complete lack of coverage of, R vectors is a major handicap for Tidy “graduates” to making use of most of R’s statistical functions and statistical packages.
  • Tidy is too abstract for beginners, due to the philosophy of functional programming (FP). The latter is popular with many sophisticated computer scientists, but is difficult even for computer science students. Tidy is thus unsuited as the initial basis of instruction for nonprogrammer students of R. FP should be limited and brought in gradually. The same statement applies to base-R’s own FP functions.
  • The FP philosophy replaces straightforward loops with abstract use of functions. Since functions are the most difficult aspect for noncoder R learners, FP is clearly not the right path for such learners. Indeed, even many Tidy advocates concede that it is in various senses often more difficult to write Tidy code than base-R. Hadley says, for instance, “it may take a while to wrap your head around [FP].”
  • A major problem with Tidy for R beginners is cognitive overload: The basic operations contain myriad variants. Though of course one need not learn them all, one needs some variants even for simple operations, e.g. pipes on functions of more than one argument.
  • The obsession among many Tidyers that one must avoid writing loops, the ‘$’ operator, brackets and so on often results in obfuscated code. Once one goes beyond the simple mutate/select/filter/summarize level, Tidy programming can be of low readability.
  • Tidy advocates also concede that debugging Tidy code is difficult, especially in the case of pipes. Yet noncoder learners are the ones who make the most mistakes, so it makes no sense to have them use a coding style that makes it difficult to track down their errors.
  • Note once again, that in discussing teaching, I am taking the target audience here to be nonprogrammers who wish to use R for data analysis. Eventually, they may wish to make use of FP, but at the crucial beginning stage, keep it simple, little or no fancy stuff.

Issues in Differential Privacy

Differential privacy (DP) is an approach to maintaining the privacy of individual records in a database, while still allowing  statistical analysis. It is now perceived as the go-to method in the data privacy area, enjoying adoption by the US Census Bureau and several major firms in industry, as well as a highly visible media presence. DP has developed a vast research literature. On the other hand, it is also the subject of controversy, and now, of lawsuits.

Some preparatory remarks:

I’ve been doing research in the data privacy area off and on for many years, e.g. IEEE Symposium on Security and Privacy; ACM Trans. on Database Systems; several book chapters; and current work in progress, arXiv. I was an appointed member of the IFIP Working Group 11.3 on Database Security in the 1990s. The ACM TODS paper was funded in part by the Census Bureau.

Notation: the database consists of n records on p variables.

The R package diffpriv makes use of standard DP methods easy to implement, and is recommended for any reader who wishes to investigate these issues further.

Here is a classical example of the privacy issue. Say we have an employee database, and an intruder knows that there is just one female electrical engineer in the form. The intruder submits a query on the mean salary of all female electrical engineers, and thus illicitly obtains her salary.

What is DP?

Technically DP is just a criterion, not a method, but the term generally is taken to mean methods whose derivation is motivated by that criterion.

DP is actually based on a very old and widely-used approach to data privacy, random noise perturbation. It’s quite simple. Say we have a database that includes a salary variable, which is considered confidential. We add random, mean-0 noise, to hide a person’s real income from intruders. DP differs from other noise-based methods, though, in that it claims a quantitative measure of privacy.

Why add noise?

The motivation is that, since the added noise has mean 0, researchers doing legitimate statistical analysis can still do their work. They work with averages, and the average salary in the database in noisy form should be pretty close to that of the original data, with the noise mostly “canceling out.” (We will see below, though, that this view is overly simple, whether in DP or classical noise-based methods.)

In DP methods, the noise is typically added to the final statistic, e.g. to a mean of interest, rather than directly to the variables.

One issue is whether to add a different noise value each time a query arrives at the data server, vs. adding noise just once and then making that perturbed data open to public use. DP methods tend to do the former, while classically the latter approach is used.


A related problem is that for most DP settings, a separate DP version needs to be developed for every statistical method. If a user wants, say, to perform quantile regression, she must check whether a DP version has been developed and code made available for it. With classical privacy methods, once the dataset has been perturbed, users can apply any statistical method they wish. I liken it to an amusement park. Classical methods give one a “day pass” which allows one to enjoy any ride; DP requires a separate ticket for each ride.

Privacy/Accuracy Tradeoffs:

With any data privacy method, DP or classical, there is no perfect solution. One can only choose a “dial setting” in a range of tradeoffs. The latter come in two main types:

  • There is a tradeoff between protecting individual privacy on the one hand, and preserving the statistical accuracy for researchers. The larger the variance of added noise, the greater the privacy but the larger the standard errors in statistical quantities computed from the perturbed data.
  • Equally important, though rarely mentioned, there is the problem of attenuation of relationships between the variables. This is the core of most types of data analysis, finding and quantifying relationships; yet the more noise we add to the data, the weaker the reported relationships will be. This problem arises in classical noise addition, and occurs in some DP methods, such as ones that add noise to counts in contingency tables. So here we have not only a variance problem but also a bias problem; the absolute values of correlations, regression coefficients and so on are biased downward. A partial solution is to set the noise correlation structure equal to that of the data, but that doesn’t apply to categorical variables (where the noise addition approach doesn’t make much sense anyway).

Other classical statistical disclosure control  methods:

Two other major data privacy methods should be mentioned here.

  1.  Cell suppression: Any query whose conditions are satisfied by just one record in the database is disallowed. In the example of the female EE above, for instance, that intruder’s query simply would not be answered. One problem with this approach is that it is vulnerable to set-differencing attacks. The intruder could query the total salaries of all EEs, then query the male EEs, and then subtract to illicitly obtain the female EE’s salary. Elaborate methods have been developed to counter such attacks.
  2. Data swapping: For a certain subset of the data — either randomly chosen, or chosen according to a record’s vulnerability to attack — some of the data for one record is swapped with that of a similar record. In the female EE example, we might swap occupation or salaries, say.

Note that neither of these methods avoids the problem of privacy/accuracy tradeoffs. In cell suppression, the more suppression we impose, the greater the problems of variance and bias in stat analyses. Data swapping essentially adds noise, again causing variance and bias.

The DP privacy criterion:

Since DP adds random noise, the DP criterion is couched in probabilistic terms. Consider two datasets, D and D’, with the same variables and the same number of records n, but differing in 1 record. Consider a given query Q. Denote the responses by Q(D) and Q(D’). Then the DP criterion is, for any set S in the image of Q,

P(Q(D) in S) < P(Q(D’) in S) exp(ε)

for all possible (D,D’) pairs and for a small tuning parameter ε. The smaller ε, the greater the privacy.

Note that the definition involves all possible (D,D’) pairs; D here is NOT just the real database at hand (though there is a concept of local sensitivity in which D is indeed our actual database). On the other hand, in processing a query, we ARE using the database at hand, and we compute the noise level for the query based on n for this D.

DP-compliant methods have been developed for various statistical quantities, producing formulas for the noise variance as a function of ε and an anticipated upper bound on Δ = |Q(D) – Q(D’)|.  Again, that upper bound must apply to all possible (D,D’) pairs. For human height, say, we know that no one will have height, say, 300 cm, which could take for Δ If our query Q() is for the mean height, we divide Δ by n; it’s a rather sloppy bound, but it would work.

Problems with DP’s claims of quantifiable guaranteed privacy:

(Many flaws have been claimed for DP, but to my knowledge, this analysis is new.)

Again consider the female EE example. One problem that arises right away is that, since this is a conditional mean, Q(D) and/or Q(D’) will be often be undefined.

It’s not clear that existing implementations of DP are prepared to handle this. E.g. consider the diffpriv package. It appears to do  nothing to deal with this problem. The Google Differential Privacy Library uses SQL, we have trouble:

If database access is via SQL, the problem would result in returning NULL if the conditioning set is empty for some D or D’. Since such queries are central to data analysis, this would be a serious problem. 

It’s not clear that the Census Bureau’s TopDown algorithm handles the problem either. They treat the data as a giant contingency table,  adding noise to each cell. All queries are then processed as functions of the cell counts, with no further noise added.

A major issue appears to be that many cells that had non-0 counts originally will now be 0s in the noisy version of the data. The added random noise will produce negative numbers in many cases, and though the Bureau’s procedure changes those to at least 0, many will stay at 0. One then has the “0 denominator” issue described above.

Another issue is queries for totals, say the total salaries for all female workers.  The noise that would be added, say based on maximum salary, would be the same regardless of whether the there are 10 women in the firm or 1000. While DP would give mathematically correct noise here, having the noise be the same regardless of the overall size of the total seems unwarranted.

Bias problems:

As noted, DP methods that work on contingency tables by adding independent noise values to cell counts can attenuate correlation and thus produce bias. The bias will be substantial for small tables.

Another issue, also in DP contingency table settings, is that bias may occur from post-processing. If Laplacian noise is added to counts, some counts may be negative. As shown in Zhu et al, post-processing to achieve nonnegativity can result in bias.

The US Census Bureau’s adoption of DP:

The Census Bureau’s DP methodology replaces the swapping-based approach used in past census reports. Though my goal in this essay has been mainly to discuss DP in general, I will make a few comments.

First, what does the Bureau intend to do? They will take a 2-phase approach. They view the database as one extremely large contingency table (“histogram,” in their terminology). Then they add noise to the cell counts. Next, they modify the cell counts to satisfy nonnegativity and certain other constraints, e.g. taking the total number of residents in a census block to be invariant. The final perturbed histogram is released to the public.

Why are they doing this? The Bureau’s simulations indicate that, with very large computing resources and possibly external data, an intruder could reconstruct much of the original, unperturbed data.

Criticisms of the Census Bureau’s DP Plan:

In addition to the general concerns about DP, there are also ones specific to the Bureau’s DP methods.

The Bureau concedes that the product is synthetic data. Well, isn’t any perturbed data synthetic? Yes, but here ALL of the data is perturbed, as opposed to swapping, where only a small fraction of the data changes.

Needless to say, then, use of synthetic data has many researchers up in arms. They don’t trust it, and have offered examples of undesirable outcomes, substantial distortions that could badly effect research work in business, industry and science. There is also concern that there will be serious impacts on next year’s congressional redistricting, which strongly relies on census data, though one analysis is more optimistic.

There has already been one lawsuit against the Bureau’s use of DP. Expect a flurry more, after the Bureau releases its data — and after redistricting is done based on that data. So it once again boils down to the privacy/accuracy tradeoff. Critics say the Bureau’s reconstruction scenarios are unlikely and overblown. Again, add to that the illusory nature of DP’s privacy guarantees, and the problem gets even worse.

Final comments:

How did we get here? DP has some serious flaws. Yet it has largely become entrenched in the data privacy field. In addition to being chosen as the basis of the census data, it is used somewhat in industry. Apple, for instance, uses classic noise addition, applied to raw data, but with a DP privacy budget.

As noted, early DP development was done mainly by CS researchers.  CS people view the world in terms of algorithms, so that for example they feel very comfortable with investigating the data reconstruction problem and applying mathematical optimization techniques. But the CS people tend to have poor insight into what problems the users of statistical databases pursue in their day-to-day data analysis activities.

Some mathematical statisticians entered the picture later, but by then DP had acquired great momentum, and some intriguing theoretical problems had arisen for the math stat people to work on. Major practical issues, such as that of conditional quantities, were overlooked.

In any case, the major players in DP continue to be in CS. For example, all 10 of the signatories in a document defending the Census Bureau DP plan are in CS.

In other words, in my view, the statistician input into the development of DP came too little, too late. Also, to be frank, the DP community has not always been open to criticism, such as the skeptical material in Bambauer, Muralidhar and Sarathy.

Statistical disclosure control is arguably one of the most important data science issues we are facing today.  Bambauer et al in the above link sum up the situation quite well:

The legal community has been misled into thinking that differential privacy can offer the benefits of data research without sacrificing privacy. In fact, differential privacy will usually produce either very wrong research results or very useless privacy protections. Policymakers and data stewards will have to rely on a mix of approaches: perhaps differential privacy where it is well-suited to the task, and other disclosure prevention techniques in the great majority of situations where it isn’t.

A careful reassessment of the situation is urgently needed.


The Notion of “Double Descent”

I tend to be blase’ about breathless claims of “new” methods and concepts in statistics and machine learning. Most are “variations on a theme.” However, the notion of double descent, which has come into prominence in the last few years, is something I regard as genuinely new and very relevant, shedding important light on the central issue of model overfitting.

In this post, I’ll explain the idea, and illustrate it using code from my regtools R package.  (See also a related discussion, with a spline example by Daniela Witten.)

The idea itself is not new, but is taking special importance these days, in its possibly shedding light on deep learning. It seems to suggest an answer to the question, “Why do DL networks often work well, in spite of being drastically overparameterized?”

Classical statistical thinking, including in my own books, is that the graph of loss L e.g. misclassification rate, on new data against model complexity C should be U-shaped. As C first moves away from 0, bias is greatly reduced while variance increases only slightly. The curve is in descent. But once C passes a minimum point, the curve will go back up, as variance eventually overwhelms bias.

(Technical note: Bias here is the expected value of the difference between the predicted values of the richer and coarser models, under the richer one, and taken over the distribution of the predictors.)

As C increases, we will eventually reach the point of saturation (nowadays termed interpolation), in which we have 0 training error. A common example is linear regression with a polynomial model in a single predictor variable. Here C is the degree of the polynomial. If we have n = 100 data points, a 99th-degree polynomial will fit the training data exactly — but will be terrible in predicting new data.

But what if we dare to fit a polynomial of higher degree than 99 anyway, i.e. what if we deliberately overfit? The problem now becomes indeterminate — there is now no unique solution to the problem of minimizing the error sum of squares. There are in fact infinitely many solutions. But actually, that can be a virtue; among those infinitely many solutions, we may be able to choose one that is really good.

“Good” would of course mean that it is able to predict new cases well. How might we find such a solution?

I’ll continue with the linear regression example, though not assume a polynomial model. First, some notation. Let β denote our population coefficient vector, and b its estimate. Let ||b|| denote the vector norm of b. Let p denote the number of predictor variables. Again, if we have p predictor variables, we will get a perfect fit in the training set if p = n-1.

If you are familiar with the LASSO, you know that we may be able to do better than computing b to be the OLS estimate; a shrunken version of OLS may do better. Well, which shrunken version? How about the minimum-norm solution?

Before the interpolation point, our unique OLS solution is the famous

b = (X’X)-1 X’Y

This can also be obtained as b = X Y where X is a generalized inverse of X. It’s the same as the classic formula before interpolation, but it can be used after the interpolation point as well. And the key point is then that one implementation of this, the Moore-Penrose inverse, gives the minimum-norm solution.

This minimum-norm property reduces variance. So, not only will MP allow us to go past the interpolation point, it will also reduce variance, possibly causing the L-C curve to descend a second time! We now have a double U-shape (there could be more).

And if we’re lucky, in this second U-shape, we may reach a lower minimum than we had in the original one. If so, it will have paid to overfit!

There some subtleties here. The minimum norm means, minimum among all solutions for fixed p. As p increases, the space over which we are minimizing grows richer, thus more opportunities for minimizing the norm. On the other hand, the variance of that minimum norm solution is increasing with p. Meanwhile the bias is presumably staying constant, since all solutions have a training set error of 0. So the dynamics underlying this second U would seem to be different from those of the first.

Now, what about the implication for deep learning? DL computation typically uses stochastic gradient descent, and there has been theoretical and empirical work indicating that SGD tends to settle on a minimum-norm solution. This may explain the examples of success of DL in spite of overfitting.

Empirical illustration:

Here I’ll work with the Million Song dataset. It consists of 90 audio measurements made on about 500,000 songs (not a million) from 1922 to 2011. The goal is to predict the year of release from the audio.

I took the first p predictors, varying p from 2 to 30, and fit a quadratic model, with O(p2) predictors resulting.

One of the newer features of regtools is its qe*-series (“Quick and Easy”) of functions. They are extremely simple to use, all having the call format qeX(d,’yname’), to predict the specified Y in the data frame d. Again, the emphasis is on simplicity; the above call is all one needs, so for example there is no preliminary code for defining a model. They are all wrappers for standard R package functiona, and are paired with predict() and plot() wrappers.

Currently there are 9 different machine learning algorithms available, e.g qeSVM() and qeRF(), for SVM and random forests. Here we will use qePoly(), which wraps our polyreg package. Note by the way that the latter correctly handles dummy variables (e.g. no powers of a dummy are formed). Note too that qePoly() computes the Moore-Penrose inverse if we are past interpolation, using the function ginv() from the MASS package.

Again to make this experiment on a meaningful scale, I generated random training sets of size n = 250, and took the rest of the data as the test set. I used from 2 to 30 predictor variables, and used Mean Absolute Prediction Error as my accuracy criterion. Here are the results,

Well, sure enough, there it is, the second U. The interpolation point is between 22 and 23 predictors (there is no “in-between” configuration), where there are 265 parameters, overfitting our n = 250.

Alas, the minimum in the second U is not lower than that of the first, so overfitting hasn’t paid off here. But it does illustrate the concept of double-descent.



overfit <-  function(nreps,n,maxP)
    nas <- rep(NA,nreps*(maxP-1))
    outdf <- data.frame(p=nas,mape=nas)
    rownum <- 0
    for (i in 1:nreps) {
       idxs <- sample(1:nrow(yr),n)
       trn <- yr[idxs,] 
       tst <- yr[-idxs,]           
       for (p in 2:maxP) {
          rownum <- rownum + 1
'V1',2,holdout=NULL) preds <- predict(out,tst[,-1]) mape <- mean(abs(preds - tst[,1])) outdf[rownum,1] <- p outdf[rownum,2] <- mape print(outdf[rownum,]) } } outdf #run through tapply() for the graph }

Efron Updates breiman’s “two cultures” essay

The June 2020 issue of JASA features a highly insightful essay by Brad Efron, dean of the world’s statisticians. The article is accompanied by commentary by a number of statistical luminaries. Important, indeed central questions are raised. I would hope JASA runs more pieces of this nature.

The essay may be viewed as an update of the classic 2001 article by Leo Breiman, regarding two largely separate cultures, statistics and machine learning (ML), updating not only in the sense of progress made since 2001, but also in the divide between Prediction and Description (“Attribution”) goals of regression analysis. (I of course use the latter term to mean estimation of E(Y | X), whether parametrically or not.)

I’ll provide my own commentary here, and, this being a statistics/R blog, bring in some related comments on R. I’ll have two main themes:

  • The gap between the two cultures is as wide today as ever.
  • There are fundamental problems with both cultural approaches, with a lot of “the emperor has no clothes” issues.

Parametric vs. Nonparametric Methods

Note carefully that the cultural difference is NOT one of parametric vs. nonparametric methods. For instance, statisticians were working on k-Nearest Neighbors as far back as the 1950s, if not earlier. Tree-based methods, notably random forests, were developed largely by statisticians in the 1970s, and arguably the best implementations are in R, the statisticians’ language, rather than Python, the MLers’ language.

Sampling Variation

Instead, I believe the widest gap between the two cultures involves sampling variation. It forms the very core of statistics, while ML mostly ignores it. I’ve observed this over the years in writings and statements by ML people, and in conversation with them.

This was exemplified in a recent Twitter exchange I had with a prominent ML researcher. I’d lamented that ML people don’t care about sampling variation, and the ML person responded, “What about causal models? Sampling variation is not an issue there.” But of course it IS an issue, a major one. Is the causal effect one seems to have found real, or just a sampling accident?

By the way, this extends to the research level, resulting in statistical work often focusing on asymptotics, while in ML the focus is on proving finite inequalities about the data itself, no connection to a population.

Frankly, I find that ML people just don’t seem to find statistics relevant to their work, and often don’t know much about stat. For example, in a book written by another prominent MLer, he recommends standardizing data to mean-0, variance-1 form before performing neural network analysis. Indeed, for reasons that seem to be unknown, this is mandatory, as the method may not converge otherwise. But I was shocked to see the MLer write that standardization presumes a Gaussian distribution, which is certainly not true.

R vs. Python (and R vs. R)

As noted, statisticians tend to use R, while ML people almost universally use Python. This in turn stems from the fact that ML has basically become a computer science field. Python is common, virtually standard in CS coursework and much of the CS research, so it is natural for MLers to gravitate to that language. R, of course, has been the lingua franca of stat.

That’s all fine, but it is adding to the cultural gap, to the detriment of both camps. Years ago I wrote, “R is written by statisticians, for statisticians” and that R is Statistically Correct. Sadly, in my view the Python ML software doesn’t meet that standard, which again stems from the general lack of statistical expertise in ML. On the other hand, there are a number of “full pipeline” packages in Python, notably for image classification, with nothing comparable in R.

To their credit, JJ Allaire of RStudio and Wes McKinney of the Python world have worked on tools to “translate” between the two languages. But I’m told by a prominent R developer that there is resistance on the Python side. Hopefully this will change.

Meanwhile, R itself is undergoing its own cultural split(s). On the language level, there is the base-R vs. Tidyverse split. (I consider the latter to be taking R in the wrong direction). But beyond that, a large and growing segment of R users see the language only as a tool for data wrangling (e.g. case filtering) rather than statistics. It should be viewed as both, and the lack of such a view will be problematic for stat users in the coming years, I believe.

“The Emperor Has No Clothes”

There are serious problems within each of the two cultures. So far, analysts have succeeded in blithely ignoring them, but I worry that this can’t continue.

Heavy use of cross-validation:

Efron notes, “A crucial ingredient of modern prediction methods is the training/test set paradigm…” This is combined with grid search to find the “best” combination of tuning parameters. (Called hyperparameters in the ML field. ML has its own terminology, e.g. inference for what stat people call prediction. Yet another cultural divide.)

For some ML methods, the number of tuning parameters can be large, and when combined in Cartesian product with many values for each one, we have a very large search space. That in turn raises a simultaneous inference problem, colloquially known today as p-hacking. The parameter combination that appears best may be far from best.

The fineTuning() function in my regtools package computes Bonferroni (Dunn) intervals for the various combinations as a way to partially deal with the problem. It also features a graphical exploration tool.

Similarly, some of those dazzling results from ML methods in competitions need to be taken with a grain of salt. There presumably is a p-hacking problem here as well; different contestants keep trying various parameter values and algorithm tweaks until they get a better value — whether by genuinely better technique or by random accident. A probability record-values analysis would be interesting here. Also see the impressive empirical findings in Rebecca Roelofs’ dissertation.

Smoothness assumptions:

To me, the most interesting, thought-provoking part of Efron’s essay is the material on smoothness. But it brings to mind another “emperor has no clothes” issue.

The fact is that in real life, there is no such thing as smoothness, since all measurements are discrete rather than continuous. At best, human height is measured, say, only to 0.2 centimeter and air temperature to, say, 0.1 degree. Hence there are not even first derivatives, let alone higher-order ones. This calls into question stat theory (“Assume a continuous second derivative…”), and more importantly, any formulaic (as opposed to aesthetic) method for choosing the smoothness of a fitted curve.


Classic parametric theory has results along the lines of needing p = o(sqrt(n)) to obtain consistent estimates. There are various nonparametric results of this nature. There are more recent results regarding p > n, but with conditions that are typically unverifiable.

On the ML side, adherents have shown a number of successes in which p >> n. A typical “optimal” neural network may, after all the dust clears, have tens of millions of parameters (hidden-layer weights) with n only in the tens of thousands. Some NN specialists have even said overfitting is actually desirable. Yet, to my knowledge, there is no theoretical justification for this. (Efron notes this too.) Is it true? Or is there something else going on?

Neural networks as black boxes:

For that matter, what makes NNs work anyway? Efron describes them as “essentially elaborate logistic regression programs.” With a sigmoid activation function, and with the understanding that logits are composed together rather than, say added, that is true. However, in our work we show something more basic: NNs are essentially performing good old-fashioned polynomial regression. This has practical implications, e.g. that multicollinearity among the layer outputs increases from layer to layer. (We also have a corresponding R package, polyreg.)

“Rectangular” data:

In recent years, it has been found that for most applications, NNs will not do especially well, unlike their succeses with image classification for instance. Out of this has come Conventional Wisdom statements like, “For rectangular data, you may do better with non-NN methods.”

But upon closer inspection, one sees this particular “emperor” to be especially unclothed. Consider the famous MNIST data, consisting of 65,000 28×28 images. One can store this is a 65000 x 784 matrix (2ix28 = 784). In other words, n = 65000 and p = 784. Isn’t that rectangular? One prominent MLer uses the term perceptive data instead. That is not a satisfying explanation at all, nor, as noted, is the “rectangular” one.

So, are image and natural language data, putative areas of excellence for NNs, really different from all other areas? This seems unlikely. Instead, the likely explanation is that enormous efforts were expended in these areas, and NNs happened to be the tool that was used. Indeed, some researchers have taken the outputs of convolutional layers (which are just classic image operations), and run them through SVMs, obtaining results equal to or better than NNs.

So, what do we have? This discussion brings to mind an old paper, delightfully titled “What’s Not What What in Statistics.” There is so much more than could be added today.

New Book on Machine Learning

I’m nearing completion of writing my new book, The Art of Machine Learning: Algorithms+Data+R, to be published by the whimsically named No Starch Press. I’m making a rough, partial draft available, and welcome corrections, suggestions and comments.

I’ve been considering doing such a project for some time, intending to write a book that would on the one hand serve as “machine learning for the masses” while ardently avoiding being of a “cookbook” nature. In other words, the book has two goals:

  • The math content is kept to a minimum. Readers need only be able to understand scatter plots and the like, and know the concept of the slope of a line. (For readers who wish to delve into the math, a Math Companion document will be available.)
  • There is strong emphasis on building a solid intuitive understanding of the methods, empowering the reader to conduct effective, penetrating ML analysis.

As I write in the preface (“To the Reader”),

“Those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight.”

The language of instruction is R, using standard CRAN packages. But as I also write,

“…this is a book on ML, not a book on using R in ML. True, R plays a major supporting role and we use prominent R packages for ML throughout the book, with code on almost every page. But in order to be able to use ML well, the reader should focus on the structure and interpretation of the ML models themselves; R is just a tool toward that end.”

So, take a look, and let me know what you think!

FOCI: a new method for feature selection

It’s been a while since I’ve posted here, not for lack of material but just having too long a TO DO list. I’ve added some important features to my regtools package, for instance, and hope to discuss them here soon.

For the present post, though, I will discuss FOCI, a highly promising new approach to feature selection, developed by Mona Azadkia and Sourav Chatterjee of the Stanford Stat Dept. There is now an R package of the same name on CRAN.

I am one of the authors of the package, but was not involved in developing the method or the associated theory. My connection here was accidental. A few months ago, Sourav gave a talk on FOCI at Stanford. I must admit that I had been skeptical when I read the abstract. I had seen so many proposals on this topic, and had considered them impractical, full of unverifiable assumptions and rather opaque indications regarding asymptotic behavior. So, I almost didn’t attend, even though I was on the Stanford campus that day.

But I did attend, and was won over. The method has only minimal assumptions, explicit asymptotic behavior, and lo and behold, no tuning parameters.

I tried FOCI on some favorite datasets that evening, and was pleased with the results, but the algorithm was slow even on moderate-sized datasets. So I wrote to Sourav, praising the method but suggesting that they parallelize the software. He then asked if I might get involved in that, which I did.

You can find Mona and Sourav’s arXiv paper at, which extends earlier work by Sourav. The theory is highly technical, I’d say very intricate even by math stat standards, thus not for the faint of heart. Actually, it was amusing during Sourav’s talk when one of the faculty said, “Wait, don’t leave that slide yet!”, as the main estimator (T_n, p.3) shown on that slide, is a challenge to interpret.

The population-level quantity, Eqn. (2.1), is more intuitive. Say we are predicting Y from a vector-valued X, and are considering extending the feature set to (X,Z). The expression is still a bit complex — and please direct your queries to Mona and Sourav for details, not to me 🙂 — but at its core it is comparing Var(Y | X,Z) to Var(Y | X). How much smaller is the former than the latter? Of course, there are expectations applied etc., but that is the intuitive heart of the matter.

So here is a concrete example, the African Soil remote sensor dataset from Kaggle. After removing ID, changing a character column to R factor then to dummies, I had a data frame with n = 1157 and p = 3578. So p > n, with p being almost triple the size of n, a nice test for FOCI. For Y, there are various outcome variables; I chose pH.

The actual call is simple:

fout <- foci(afrsoil[,3597],afrsoil[,1:3578])

This specifies the outcome variable and features. There are some other options available. The main part of the output is the list of selected variables, in order of importance. After that, the cumulative FOCI correlations are shown:

> fout$selectedVar
    index    names
 1:  3024 m1668.14
 2:  3078    m1564
 3:  2413 m2846.45
 4:  3173  m1380.8
 5:  3267 m1199.52
 6:  1542 m4526.16
 7:  2546 m2589.96
 8:  3537 m678.828
 9:  3469 m809.965
10:  2624 m2439.54
11:  1874  m3885.9
12:  2770 m2157.98
> fout$stepT[1:12]
 [1] 0.1592940 0.3273176 0.4201255 0.5271957 0.5642336 0.5917753 0.6040584
 [8] 0.6132360 0.6139810 0.6238901 0.6300294 0.6303839

Of course, those feature names, e.g. ‘m1564’, are meaningful only to a soil scientist, but you get the idea. Out of 3578 candidate features, FOCI chose 12, attaining very strong dimension reduction. Actually, the first 8 or so form the bulk of the selected feature set, with the remainder yielding only marginal improvements.

But is it a GOOD feature set? We don’t know in this case, but Mona and Sourav have a very impressive example in the paper, in which FOCI outperforms random forests, LASSO, SCAD and so on. That, combined with the intuitive simplicity of the estimator etc., makes FOCI a very promising addition to anyone’s statistical toolkit. I’m sure more empirical comparisons are forthcoming.

And what about the computation speed? Two types of parallelism are available, with best results generally coming from using both in conjunction: 1. At any step, the testing of the remaining candidate features can be done simultaneously, on multiple threads on the host machine. 2. On a cluster or in the cloud, one can ask that the data be partitioned in chunks of rows, with each cluster node applying FOCI to its chunk. The union of the results is then formed, and fed through FOCI one more time to adjust the discrepancies. The idea is that that last step will not be too lengthy, as the number of candidate variables has already been reduced. A cluster size of r may actually produce a speedup factor of more than r (Matloff 2016).

Approach (2) also has the advantage that it gives the
user an idea of the degree of sampling variability in the FOCI

So there you have it. Give FOCI a try!

Musings, useful code etc. on R and data science