Use of Differential Privacy in the US Census–All for Nothing?

The field of data privacy has long been of broad interest. In a medical database, for instance, how can administrators enable statistical analysis by medical researchers, while at the same time protecting the privacy of individual patients? Over the years, many methods have been proposed and used. I’ve done some work in the area myself.

But in 2006, an approach known as differential privacy (DP) was proposed, by a group of prominent cryptography researchers. With its catchy name and theoretical underpinnings, DP immediately attracted lots of attention. As it is more mathematical than many other statistical disclosure control methods, thus good fodder for theoretical research–it immediately led to a flurry of research papers, showing how to apply DP in various settings.

DP was also adopted by some firms in industry, notably Apple. But what really gave DP a boost was the decision by the US Census Bureau to use DP for their publicly available data, beginning with the most recent census, 2020. On the other hand, that really intensified the opposition to DP. I have my own concerns about the method.

The Bureau, though, had what it considered a compelling reason to abandon their existing privacy methods: Their extensive computer simulations showed that current methods were vulnerable to attack, in such a manner as to exactly reconstruct large portions of the “private” version of the census database. This of course must be avoided at all costs, and DP was implemented.

But now…it turns out that the Bureau’s claim of reconstructivity. was incorrect, according to a recent paper by Krishna Muralidhar, who writes,

“This study shows that there are a practically infinite number of possible reconstructions, and each reconstruction leads to assigning a different identity to the respondents in the reconstructed data. The results reported by the Census Bureau researchers are based on just one of these infinite possible reconstructions and is easily refuted by an alternate reconstruction.”

This is one of the most startling statements I’ve seen in my many years in academia. It would appear that the Bureau committed a “rush to judgment” on a massive scale, just mind boggling, and in addition–much less momentous but still very concerning–gave its imprimatur to methodology that many believe has serious flaws.


Base-R and Tidyverse Code, Side-by-Side

I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy.

As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once proficient, find that they are only equipped to work in a very narrow range of operations. As a result, we see a flurry of online questions from Tidy users asking “How do I do such-and-such,” when a base-R solution would be simple and straightforward.

I believe the examples here illustrate that base-R solutions tend to be simpler, and thus that base-R is a better vehicle for R learners. However, another use of this document would be as a tutorial for base-R users who want to learn Tidy, and vice versa.

A New Approach to Fairness in Machine Learning

During the last year or so, I’ve been quite interested in the issue of fairness in machine learning. This area is more personal for me, as it is the confluence of several interests of mine:

  • My lifelong activity in probability theory, math stat and stat methodology (in which I include ML).
  • My lifelong activism aimed at achieving social justice.
  • My extensive service as an expert witness in litigation involving discrimination (including a land mark age discrimination case, Reid v. Google).

(Further details in my bio.) I hope I will be able to make valued contributions.

My first of two papers in the Fair ML area is now on arXiv. The second should be ready in a couple of weeks.

The present paper, with my former student Wenxi Zhang, is titled, A Novel Regularization Approach to Fair ML. It’s applicable to linear models, random forests and k-NN, and could be adapted to other ML models.

Wenxi and I have a ready-to-use R package for the method, EDFfair. It uses my qeML machine learning library. Both are on GitHub for now, but will go onto CRAN in the next few weeks.

Please try the package out on your favorite fair ML datasets. Feedback, both on the method and the software, would be greatly appreciated.

Base-R Is Alive and Well

As many readers of this blog know, I strongly believe that R learners should be taught base-R, not the tidyverse. Eventually the students may settle on using a mix of the two paradigms, but at the learning stage they will benefit from the fact that base-R is simple and more powerful. I’ve written my thoughts in a detailed essay.

One of the most powerful tools in base-R is tapply(), a workhorse of base-R. I give several examples in my essay in which it is much simpler and easier to use that function instead of the tidyverse.

Yet somehow there is a disdain for tapply() among many who use and teach Tidy. To them, the function is the epitome of “what’s wrong with” base-R. The latest example of this attitude arose in Twitter a few days ago, in which two Tidy supporters were mocking tapply(), treating it as a highly niche function with no value in ordinary daily usage of R. They strongly disagreed with my “workhorse” claim, until I showed them that in the code of ggplot2, Hadley has 7 calls to tapply(),

So I did a little investigation of well-known R packages by RStudio and others. The results, which I’ve added as a new section in my essay, are excerpted below.


All the breathless claims that Tidy is more modern and clearer, whilc base-R is old-fashioned and unclear, fly in the face of the fact that RStudio developers, and authors of other prominent R packages, tend to write in base-R, not Tidy. And all of them use some base-R instead of the corresponding Tidy constructs.

package *apply() calls mutate() calls
brms 333 0
broom 38 58
datapasta 31 0
forecast 82 0
future 71 0
ggplot2 78 0
glmnet 92 0
gt 112 87
knitr 73 0
naniar 3 44
parsnip 45 33
purrr 10 0
rmarkdown 0 0
RSQLite 14 0
tensorflow 32 0
tidymodels 8 0
tidytext 5 6
tsibble 8 19
VIM 117 19

Striking numbers to those who learned R via a tidyverse course. In particular, mutate() is one of the very first verbs one learns in a Tidy course, yet mutate() is used 0 times in most of the above packages. And even in the packages in which this function is called a lot, they also have plenty of calls to base-R *apply(), functions which Tidy is supposed to replace.

Now, why do these prominent R developers often use base-R, rather than the allegedly “modern and clearer” Tidy? Because base-R is easier.

And if it’s easier for them, it’s even further easier for R learners. In fact, an article discussed later in this essay, aggressively promoting Tidy, actually accuses students who use base-R instead of Tidy as taking the easy way out. Easier, indeed!