All posts by matloff

See bio at http://heather.cs.ucdavis.edu/matloff.html

New Statistics Tutorial

I’ve recently completed fastStat, https://github.com/matloff/fastStat,a quick introduction to statistics for those who’ve had a calculus-based probability course. Many such people later need to do statistics, and this will give them quick access. It is modeled after my R tutorial, https://github.com/matloff/fasteR, a quick introduction to R.

It is not just a quick introduction, but a REAL one, a practical one. Even those who do already know statistics will find that they learn from this tutorial.

I write at the top of the tutorial,

..many people know the mechanics of statistics very well, without truly understanding an intuitive levels what those equations are really doing, and this is our focus…For example, consider estimator bias. Students in a math stat course learn the mathematical definition of bias, after which they learn that the sample mean is unbiased and that the sample variance can be adjusted to be unbiased. But that is the last they hear of the issue…..most estimators are in fact biased, and lack ‘fixes’ like that of the sample variance. Does it matter? None of that is discussed in textbooks and courses.

The tutorial begins with sampling, with a realistic view of parametric models, estimation, standard errors, statistical inference, the Bias-Variance Tradeoff, and multivariate distributions.

It then moves to a major section on prediction, using both classical parametric and machine learning methods. Emphasis is again on the Bias-Variance Tradeoff, with a view toward overfitting. A fresh view of the latter is presented.

Finally, there is an overview of data privacy methods, of major importance today.

Take a look! Comments welcome.

Advertisement

Just How Good Is ChatGPT in Data Science?

Many of you may have heard of ChatGPT, a dazzling new AI tool. We are hearing lots of gushing praise for the tool. Well, how well does it do in data science contexts? I tried a few queries here, and found interesting results.

I first requested, “Write an R function that returns every other element of a vector x, starting with the third.” I won’t show the code here, but suffice it to say that it worked! It did give me correct code.

Next, “In R’s built-in dataset CO2, fit a linear regression with ‘uptake’ as the outcome variable, and find the standard error of the intercept term.” Alas, it begged ignorance: “…I do not have the ability to perform calculations or access specific datasets…” Not really true; unfortunately, ChatGPT was not able to pick up the key phrase, “built-in.”

I then gave it what turned out to be an easier question, ” In the output of R’s ‘lm’ function, how do I find the standard error of the intercept term?” It returned a correct and useful answer, stating the ‘lm’ returns lots of material, “…including the standard errors of the model coefficients. The standard error of the intercept term can be accessed by calling the summary function on the model object and looking at the output…”

Next: “In R, what are the advantages of using the density() function in iieu of hist()?” The reply had some correct ingredients, but had problems:

“…the resulting plot [from density()] is smoother and more continuous, which can make it easier to see the underlying distribution of the data. This can be especially useful when the data is noisy or when the number of data points is small…The density function allows you to specify the bandwidth, or the smoothness, of the curve, which can be useful for highlighting different features of the data. In contrast, the hist function does not allow you to control the smoothness of the resulting plot.”

Putting aside the fact that the phrase “more continuous” is nonsense, there are other issues. It is not necessarily true that density() will do better on small/noisy data. And the app misses the point that hist() does have an analog of the bandwidth, namely the bin width.

Finally, I asked a question that every statistician is asked by non-stat people: “How can public opinion polls be accurate if they are based on only a small fraction of the population?” The response (not shown here) notes some standard practices such as stratified sampling, but basically begs the question.

Say for example 1200 people are sampled, as is common, and we stratify by race, That would mean we design the sample to include about 160 Black people. But anyone who wondered about the small sample overall would be even more puzzled as to why 160 African-Americans is “representative.”

So in this case, ChatGPT would give a very misleading answer to an important, common question.

And we see that machines can fail Statistics, just like college students. 🙂

Use of Differential Privacy in the US Census–All for Nothing?

The field of data privacy has long been of broad interest. In a medical database, for instance, how can administrators enable statistical analysis by medical researchers, while at the same time protecting the privacy of individual patients? Over the years, many methods have been proposed and used. I’ve done some work in the area myself.

But in 2006, an approach known as differential privacy (DP) was proposed, by a group of prominent cryptography researchers. With its catchy name and theoretical underpinnings, DP immediately attracted lots of attention. As it is more mathematical than many other statistical disclosure control methods, thus good fodder for theoretical research–it immediately led to a flurry of research papers, showing how to apply DP in various settings.

DP was also adopted by some firms in industry, notably Apple. But what really gave DP a boost was the decision by the US Census Bureau to use DP for their publicly available data, beginning with the most recent census, 2020. On the other hand, that really intensified the opposition to DP. I have my own concerns about the method.

The Bureau, though, had what it considered a compelling reason to abandon their existing privacy methods: Their extensive computer simulations showed that current methods were vulnerable to attack, in such a manner as to exactly reconstruct large portions of the “private” version of the census database. This of course must be avoided at all costs, and DP was implemented.

But now…it turns out that the Bureau’s claim of reconstructivity. was incorrect, according to a recent paper by Krishna Muralidhar, who writes,

“This study shows that there are a practically infinite number of possible reconstructions, and each reconstruction leads to assigning a different identity to the respondents in the reconstructed data. The results reported by the Census Bureau researchers are based on just one of these infinite possible reconstructions and is easily refuted by an alternate reconstruction.”

This is one of the most startling statements I’ve seen in my many years in academia. It would appear that the Bureau committed a “rush to judgment” on a massive scale, just mind boggling, and in addition–much less momentous but still very concerning–gave its imprimatur to methodology that many believe has serious flaws.

Base-R and Tidyverse Code, Side-by-Side

I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy.

As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once proficient, find that they are only equipped to work in a very narrow range of operations. As a result, we see a flurry of online questions from Tidy users asking “How do I do such-and-such,” when a base-R solution would be simple and straightforward.

I believe the examples here illustrate that base-R solutions tend to be simpler, and thus that base-R is a better vehicle for R learners. However, another use of this document would be as a tutorial for base-R users who want to learn Tidy, and vice versa.

A New Approach to Fairness in Machine Learning

During the last year or so, I’ve been quite interested in the issue of fairness in machine learning. This area is more personal for me, as it is the confluence of several interests of mine:

  • My lifelong activity in probability theory, math stat and stat methodology (in which I include ML).
  • My lifelong activism aimed at achieving social justice.
  • My extensive service as an expert witness in litigation involving discrimination (including a land mark age discrimination case, Reid v. Google).

(Further details in my bio.) I hope I will be able to make valued contributions.

My first of two papers in the Fair ML area is now on arXiv. The second should be ready in a couple of weeks.

The present paper, with my former student Wenxi Zhang, is titled, A Novel Regularization Approach to Fair ML. It’s applicable to linear models, random forests and k-NN, and could be adapted to other ML models.

Wenxi and I have a ready-to-use R package for the method, EDFfair. It uses my qeML machine learning library. Both are on GitHub for now, but will go onto CRAN in the next few weeks.

Please try the package out on your favorite fair ML datasets. Feedback, both on the method and the software, would be greatly appreciated.

Base-R Is Alive and Well

As many readers of this blog know, I strongly believe that R learners should be taught base-R, not the tidyverse. Eventually the students may settle on using a mix of the two paradigms, but at the learning stage they will benefit from the fact that base-R is simple and more powerful. I’ve written my thoughts in a detailed essay.

One of the most powerful tools in base-R is tapply(), a workhorse of base-R. I give several examples in my essay in which it is much simpler and easier to use that function instead of the tidyverse.

Yet somehow there is a disdain for tapply() among many who use and teach Tidy. To them, the function is the epitome of “what’s wrong with” base-R. The latest example of this attitude arose in Twitter a few days ago, in which two Tidy supporters were mocking tapply(), treating it as a highly niche function with no value in ordinary daily usage of R. They strongly disagreed with my “workhorse” claim, until I showed them that in the code of ggplot2, Hadley has 7 calls to tapply(),

So I did a little investigation of well-known R packages by RStudio and others. The results, which I’ve added as a new section in my essay, are excerpted below.

——————————–

All the breathless claims that Tidy is more modern and clearer, whilc base-R is old-fashioned and unclear, fly in the face of the fact that RStudio developers, and authors of other prominent R packages, tend to write in base-R, not Tidy. And all of them use some base-R instead of the corresponding Tidy constructs.

package *apply() calls mutate() calls
brms 333 0
broom 38 58
datapasta 31 0
forecast 82 0
future 71 0
ggplot2 78 0
glmnet 92 0
gt 112 87
knitr 73 0
naniar 3 44
parsnip 45 33
purrr 10 0
rmarkdown 0 0
RSQLite 14 0
tensorflow 32 0
tidymodels 8 0
tidytext 5 6
tsibble 8 19
VIM 117 19

Striking numbers to those who learned R via a tidyverse course. In particular, mutate() is one of the very first verbs one learns in a Tidy course, yet mutate() is used 0 times in most of the above packages. And even in the packages in which this function is called a lot, they also have plenty of calls to base-R *apply(), functions which Tidy is supposed to replace.

Now, why do these prominent R developers often use base-R, rather than the allegedly “modern and clearer” Tidy? Because base-R is easier.

And if it’s easier for them, it’s even further easier for R learners. In fact, an article discussed later in this essay, aggressively promoting Tidy, actually accuses students who use base-R instead of Tidy as taking the easy way out. Easier, indeed!

Comments on the New R OOP System, R7

Object-Oriented Programming (OOP) is more than just a programming style; it’s a philosophy. R has offered various forms of OOP, starting with S3, then (among others) S4, reference classes, and R6, and now R7. The latter has been under development by a team broadly drawn from the R community leadership, not only the “directors” of R development, the R Core Team, but also the prominent R services firm RStudio and so on.

I’ll start this report with a summary, followed by details (definition of OOP, my “safety” concerns etc.). The reader need not have an OOP background for this material; an overview will be given here (though I dare say some readers who have this background may learn something too).

This will not be a tutorial on how to use R7, nor an evaluation of its specific features. Instead, I’ll first discuss the goals of the S3 and S4 OOP systems, which R7 replaces, especially in terms of whether OOP is the best way to meet those goals. These comments then apply to R7 as well.

SUMMARY

Simply put, R7 does a very nice job of implementing something I’ve never liked very much. I do like two of the main OOP tenets, encapsulation and polymorphism, but S3 offers those and it’s good enough for me. And though I agree in principle with another point of OOP, “safety,” I fear that it often results in a net LOSS of safety. . R7 does a good job of combining S3 and S4 (3 + 4 = 7), but my concerns about complexity and a net loss in safety regarding S4 remain in full.

OOP OVERVIEW

The first OOP language in wide use was C++, an extension of C that was originally called C with Classes. The first widely used language designed to be OOP “from the ground up” was Python. R’s OOP offerings have been limited.

Encapsulation:

This simply means organizing several related variables into one convenient package. R’s list structure has always done that. S3 classes then tack on a class name as attribute.

Polymorphism:

The term here, meaning “many forms,” simply means that the same function will take different actions when it is applied to different kinds of objects.

For example, consider a sorting operation. We would like this function to do a numeric sort if it is a applied to a vector of (real) numbers, but do an alphabetical sort on character vectors. Meanwhile, we would like to use the same function name, say ‘sort’.

S3 accomplishes this via generic functions. Even beginning R users have likely made calls to generic functions without knowing it. For instance, consider the (seemingly) ordinary plot() function. Say we call this function on a vector x; a graph of x will be displayed. But if we call lm() on some data, then call plot() on the output lmout R will display some graphs depicting that output:

mtc <- mtcars
plot(mtc$mpg) # plots mpg[i] against i
lmout <- lm(mpg ~ .,data=mtc)
plot(lmout)  # plots several graphs, e.g. residuals

The “magic” behind this is dispatch. The R interpreter will route a nominal call to plot() to a class-specific function. In the lm() example, for instance, lm() returns an S3 object of class ‘lm’, so the call plot(lmout) will actually be passed on to another function, plot.lm().

Other well-known generics are print(), summary(), predict() and coef().

Note that the fact that R and Python are not strongly-typed languages made polymorphism easy to implement. C++ on the other hand is strongly-typed, and the programmer will likely need to use templates, very painful.

By the way, I always tell beginning and intermediate R users that a good way to learn about functions written by others (including in R itself) is to run the function through R’s debug() function. In our case here, they may find it instructive to run debug(plot) and then plot(lmout) to see dispatch up close.

Inheritance:

Say the domain is pets. We might have dogs named Norm, Frank and Hadley, cats named JJ, Joe, Yihui and Susan, and more anticipated in the future.

To keep track of them, we might construct a class ‘pets’, with fields for name and birthdate. But we could then also construct subclasses ‘dogs’ and ‘cats’. Each subclass would have all the fields of the top class, plus others specific to dogs or cats. We might then also construct a sub-subclass, ‘gender.’

“Safety”:

Say you have a generic function defined for the class, with two numeric arguments, returning TRUE if the first is less than the second:

f <- function(x,y) x < y

But you accidentally call the function with two character strings as arguments. This should produce an error, but won’t

In a mission-critical setting, this could be costly. If the app processes incoming sales orders, say, there would be downtime while restarting the app, possibly lost orders, etc.

If you are worried about this, you could add error-checking code, e.g.

> f
function(x,y) {
   if (!is.numeric(x) || !is.numeric(y))
      stop('non-numeric arguments')
   x < y
}

More sophisticated OOP systems such as S4 can catch such errors for you. There is no free lunch, though–the machinery to even set up your function becomes more complex and then you still have to tell S4 that x and y above must be numeric–but arguably S4 is cleaner-looking than having a stop() call etc.

Consider another type of calamity: As noted, S3 objects are R lists. Say one of the list elements has the name partNumber, but later in assigning a new value to that element, you misspell at as partnumber:

myS3object <- partnumber

Here we would seem to have no function within which to check for misspelling etc. Thus S4 or some other “safe” OOP system would seem to be a must–unless we create functions to read or write elements of our object. And it turns out that that is exactly what OOP people advocate anyway (e.g. even in S4 etc.), in the form of getters and setters.

In the above example, for instance, say we have a class ‘Orders’, one of whose fields is partNumber. In S3, the getter in the above example would be named partNumber, and for a particular sales order thisOrder, one would fetch the part number via

get_partNumber(thisOrder)

rather than the more direct way of accessing an R list:

pn <- thisOrder$partNumber

The reader may think it’s silly to write special functions for list read and write, and many would agree. But the OOP philosophy is that we don’t touch objects directly, and instead have functions to act as intermediaries. At any rate, we could place our error-checking code in the getters and setters. (Although there still would be no way under S3 to prevent direct access.)

ANALYSIS

I use OOP rather sparingly in R, S3 in my own code, and S4, reference classes or R6 when needed for a package that I obtain from CRAN (e.g. ebimage for S4), In Python, I use OOP again for library access, e.g. threading, and to some degree, just for fun, as I like Python’s class structure.

But mostly, I have never been a fan of OOP. In particular, I never have been impressed by the “safety” argument. Here’s why:

Safety vs. Complexity

Of course, OOP does not do anything to prevent code logic errors, which are far more prevalent than, say, misspellings. And, most important:

  • There is a direct relation between safety and code complexity.
  • There is a direct relation between code logic errors and code complexity.

One of my favorite R people is John Chambers, “Father of the S Language” and thus the “Grandfather of R.” In his book, Software for Data Analysis, p.335, he warns that “Defining [an S4] class is a more serious piece of programming …than in previous chapters…[even though] the number of lines is not large…” He warns that things are even more difficult for the user of a class than it was for the author in designing it, with “advance contemplation” of what problems users may encounter. And, “You may want to try several different versions [of the class] before committing to one.”

In other words, safety in terms of misspellings etc. comes at possibly major expense in logic errors. There is no avoiding this.

There Are Other Ways to Achieve Safety:

As noted above, we do have alternatives to OOP in this regard, in the form of inserting our own error-checking code. (Note too that error-checking may be important in the middle of your code, using stopifnot().) Indeed, this can be superior to using OOP, as one has much more flexibility, allowing for more sophisticated checks.

Why the Push for R7 Now?

Very few of the most prominent developers of R packages use S4 as of now. One must conclude either that there is not a general urgency for safety and/or authors find that safety is more easily and effectively achieved through alternative means, per the above discussion.

As to encapsulation and inheritance, S3 already does a reasonably good job there. Why, then, push for R7?

The impetus seems to be a desire to modernize/professionalize R, moving closer to status as a general-purpose language. Arguably, OOP had been a weak point of R in that sense, and now R can hold its head high in the community of languages.

That’s great, but as usual I am concerned about the impact on the teaching of R to learners without prior programming experience. I’ve been a major critic of the tidyverse in that regard, as Tidy emphasizes “modern” functional programming/loop avoidance to students who barely know what a function is. Will R beginners be taught R7? That would be a big mistake, and I hope those who tend to be enthralled with The New, New Thing resist such a temptation.

Me, well as mentioned, I’m not much of an OOP fan, and don’t anticipate using R7. But the development team has done a bang-up job in creating R7, and for those who feel the need for a strong OOP paradigm, I strongly recommend it.

A Major Contribution to Learning R

Prominent statistician Frank Harrell has come out with a radically new R tutorial, rflow. The name is short for “R workflow,” but I call it “R in a box” –everything one needs for beginning serious usage of R, starting from little or no background.

By serious usage I mean real applications in which the user has a substantial computational need. This could be a grad student researcher, a person who needs to write data reports for her job, or simply a person who is doing personal analysis such as stock picking.

Like other tutorials/books, rflow covers data manipulation, generation of tables and graphics, etc. But UNLIKE many others, rflow empowers the user to handle general issues as they inevitably pop up, as opposed to just teaching a few basic, largely ungeneralizable operations. I’ve criticized the tidyverse in particular for that latter problem, but really no tutorial, including my own, has this key “R in a box” quality.

The tutorial is arranged into 19 short “chapters,” beginning with R Basics, all the way through such advanced topics as Manipulating Longitudinal Data and Parallel Computing. The exciting new Quarto presentation tool by RStudio is featured, as is the data.table package, essential for practical use of large datasets.

Note carefully that this tutorial is the product of Frank’s long experience “in the trenches,” conducting intensive data analysis in biomedical applications. (This specific field of application is irrelevant; rflow is just as useful to, say, marketing analysts, as it is for medicine.) His famous monograph, Regression Modeling Strategies, is a standard reference in the field. Even I, as the author of my own regression book, often find myself checking out what Frank has to say in his book about various topics.

This point about rflow arising from Frank’s long experience dealing with real data is absolutely key, in my view. And his choice of topics, and especially their ordering, reflects that. For instance, he brings in the topic of missing data early in the tutorial.

Anyone who teaches R, or is learning R, should check out rflow.

Greatly Revised Edition of Tidyverse Skeptic

As a longtime R user and someone with a passionate interest in how people learn, I continue to be greatly concerned about the use of the Tidyverse in teaching noncoder learners of R. Accordingly, I have now thoroughly revised my Tidyverse Skeptic essay. It is greatly reorganized with focus on teaching R, with a number of new examples, and some material on historical context of the rise of Tidy. I continue to on the one hand thank RStudio for its overall contribution to the R community but on the other believe that using Tidy for teaching beginners is actually an obstacle to learning for that group.

I close the essay by first noting that RStudio is now a Public Interest Corporation, thus with much broader public responsibility. I then renew a request I made to RStudio founder/CEO JJ Allaire when he met with me in 2019: “Please encourage R instructors to use a mixture of Tidy and base-R in their teaching.”

Please read the revised essay at the above link. Its Overview section is reproduced below.

  • Again, my focus here is on teaching R to those with little or no coding background. I am not discussing teaching Computer Science students.
  • Tidy was consciously designed to equip learners with just a small set of R tools. The students learn a few dplyr verbs well, but that equips them to do much less with R than a standard R beginners course would teach. That leaves the learners less equipped to put R to real use, compared to “graduates” of standard base-R courses.
  • Thus the “testimonials” in which Tidy teachers of R claim great success are misleading. The “success” is due to watering down the material (and false conflation with ggplot2). The students learn to mimic a few example patterns, but are not equipped to go further.
  • The refusal to teach ‘$’, and the de-emphasis of, or even complete lack of coverage of, R vectors is a major handicap for Tidy “graduates” to making use of most of R’s statistical functions and statistical packages.
  • Tidy is too abstract for beginners, due to the philosophy of functional programming (FP). The latter is popular with many sophisticated computer scientists, but is difficult even for computer science students. Tidy is thus unsuited as the initial basis of instruction for nonprogrammer students of R. FP should be limited and brought in gradually. The same statement applies to base-R’s own FP functions.
  • The FP philosophy replaces straightforward loops with abstract use of functions. Since functions are the most difficult aspect for noncoder R learners, FP is clearly not the right path for such learners. Indeed, even many Tidy advocates concede that it is in various senses often more difficult to write Tidy code than base-R. Hadley says, for instance, “it may take a while to wrap your head around [FP].”
  • A major problem with Tidy for R beginners is cognitive overload: The basic operations contain myriad variants. Though of course one need not learn them all, one needs some variants even for simple operations, e.g. pipes on functions of more than one argument.
  • The obsession among many Tidyers that one must avoid writing loops, the ‘$’ operator, brackets and so on often results in obfuscated code. Once one goes beyond the simple mutate/select/filter/summarize level, Tidy programming can be of low readability.
  • Tidy advocates also concede that debugging Tidy code is difficult, especially in the case of pipes. Yet noncoder learners are the ones who make the most mistakes, so it makes no sense to have them use a coding style that makes it difficult to track down their errors.
  • Note once again, that in discussing teaching, I am taking the target audience here to be nonprogrammers who wish to use R for data analysis. Eventually, they may wish to make use of FP, but at the crucial beginning stage, keep it simple, little or no fancy stuff.