New Statistics Tutorial

I’ve recently completed fastStat, https://github.com/matloff/fastStat,a quick introduction to statistics for those who’ve had a calculus-based probability course. Many such people later need to do statistics, and this will give them quick access. It is modeled after my R tutorial, https://github.com/matloff/fasteR, a quick introduction to R.

It is not just a quick introduction, but a REAL one, a practical one. Even those who do already know statistics will find that they learn from this tutorial.

I write at the top of the tutorial,

..many people know the mechanics of statistics very well, without truly understanding an intuitive levels what those equations are really doing, and this is our focus…For example, consider estimator bias. Students in a math stat course learn the mathematical definition of bias, after which they learn that the sample mean is unbiased and that the sample variance can be adjusted to be unbiased. But that is the last they hear of the issue…..most estimators are in fact biased, and lack ‘fixes’ like that of the sample variance. Does it matter? None of that is discussed in textbooks and courses.

The tutorial begins with sampling, with a realistic view of parametric models, estimation, standard errors, statistical inference, the Bias-Variance Tradeoff, and multivariate distributions.

It then moves to a major section on prediction, using both classical parametric and machine learning methods. Emphasis is again on the Bias-Variance Tradeoff, with a view toward overfitting. A fresh view of the latter is presented.

Finally, there is an overview of data privacy methods, of major importance today.

Take a look! Comments welcome.

Advertisement

Just How Good Is ChatGPT in Data Science?

Many of you may have heard of ChatGPT, a dazzling new AI tool. We are hearing lots of gushing praise for the tool. Well, how well does it do in data science contexts? I tried a few queries here, and found interesting results.

I first requested, “Write an R function that returns every other element of a vector x, starting with the third.” I won’t show the code here, but suffice it to say that it worked! It did give me correct code.

Next, “In R’s built-in dataset CO2, fit a linear regression with ‘uptake’ as the outcome variable, and find the standard error of the intercept term.” Alas, it begged ignorance: “…I do not have the ability to perform calculations or access specific datasets…” Not really true; unfortunately, ChatGPT was not able to pick up the key phrase, “built-in.”

I then gave it what turned out to be an easier question, ” In the output of R’s ‘lm’ function, how do I find the standard error of the intercept term?” It returned a correct and useful answer, stating the ‘lm’ returns lots of material, “…including the standard errors of the model coefficients. The standard error of the intercept term can be accessed by calling the summary function on the model object and looking at the output…”

Next: “In R, what are the advantages of using the density() function in iieu of hist()?” The reply had some correct ingredients, but had problems:

“…the resulting plot [from density()] is smoother and more continuous, which can make it easier to see the underlying distribution of the data. This can be especially useful when the data is noisy or when the number of data points is small…The density function allows you to specify the bandwidth, or the smoothness, of the curve, which can be useful for highlighting different features of the data. In contrast, the hist function does not allow you to control the smoothness of the resulting plot.”

Putting aside the fact that the phrase “more continuous” is nonsense, there are other issues. It is not necessarily true that density() will do better on small/noisy data. And the app misses the point that hist() does have an analog of the bandwidth, namely the bin width.

Finally, I asked a question that every statistician is asked by non-stat people: “How can public opinion polls be accurate if they are based on only a small fraction of the population?” The response (not shown here) notes some standard practices such as stratified sampling, but basically begs the question.

Say for example 1200 people are sampled, as is common, and we stratify by race, That would mean we design the sample to include about 160 Black people. But anyone who wondered about the small sample overall would be even more puzzled as to why 160 African-Americans is “representative.”

So in this case, ChatGPT would give a very misleading answer to an important, common question.

And we see that machines can fail Statistics, just like college students. 🙂