Any Forward Progress on p-Values?

Statisticians have long known that the use of p-values has major problems. Some of us have long called for reform, weaning the profession away from these troubling beasts. At one point, I was pleased to see Frank Harrell suggest that R should stop computing them.

That is not going to happen, but last year the ASA shocked many people by producing a manifesto stating that p-values are often wrongly used, and in any case used too much. Though the ASA statement did not go so far as to recommend not using this methodology at all, to me it came pretty close. Most significant to me (pardon the pun) was the fact that, though the ASA report said p-values are appropriate in some situations, they did not state any.examples. I wrote blog posts on this topic here, here and here. I noted too that the ASA report had even made news in Bloomberg Businessweek.

But not much seems to have changed in the professions since then, as was shown rather dramatically last Saturday. The occasion was iidata 2017, a student-run conference on data science at UC Davis. This was the second year the conference has been held, very professionally managed and fun to attend. There were data analysis competitions, talks by people from industry, and several workshops on R, including one on parallel computing in R, by UCD Stat PhD student Clark Fitzgerald.

Now, here is how this connects to the topic of p-values. I was a judge on one of the data analysis competitions, and my fellow judges and I were pretty shocked by the first team to present. The team members were clearly bright students, and they gave a very polished, professional talk. Indeed, we awarded them First Prize. However…

The team presented the p-values for various tests, not mentioning any problems regarding the large sample size, 20,000. During the judges’ question period, we asked them to comment on the effect of sample size, but they still missed the point, saying that n = 20,000 is enough to ensure that each cell in their chi-squared test would have enough observations! After quite a lot of prodding, one of them finally said there may be an issue of practical significance vs. statistical significance.

Once again, we cannot blame these bright, energetic students. They were simply doing what they had been taught to do — or, I should say, NOT doing what they had NOT been taught to do, which is to view p-values with care, especially with large n. The blame should instead be placed on the statistics faculty who taught them. The dangers of p-values should have been constantly drilled into them in their coursework, to the point at which a dataset with n = 20,000 should have been a red flag to them.

On this point, I’d like to call readers’ attention to the ASA Symposium on Statistical Inference, to be held in Bethesda, MD on October 11-13, 2017. Under the leadership of ASA Executive Director Ron Wasserstein, we are in the process of putting together what promises to be a highly stimulating and relevant program, with prominent speakers, and most important, lots of interactive participation among attendees. Hopefully much of the discussion will address ways to get serious coverage of the issue into statistics curricula, rather than the present situation, in which stat instructors briefly make a one-time comment to students about practical significance vs. statistical significance.

Threading in R?

I was pleased to see today’s post, “(A Very) Experimental Threading in R,” by Lukasz Bartnik, as this is a long-standing interest of mine. My own effort in this direction has been my package Rdsm.

The notion of threading, for those who may not have this background, refers to several instances of a program, in this case, several instances of R, sharing global variables but otherwise running independently. As Bartnik points out, this can make I/O programming easier and clearer; see my Python tutorial, Chapter 4, for a network sockets example, in which the code must deal with situations in which data may come from one of many sockets, but without foreknowledge of which socket will be next. By having a separate process devoted to each socket, but storing the incoming data in a shared variable, the problem is neatly solved (and more conveniently than using nonblocking I/O).

R examples of various sorts are given in the Rdsm package. And as Bartnik also points out, perhaps the most common situation where his or my package might be used is running an R “background job.”

In terms of speedups through parallelization, the results are mixed. See my book on parallel computation for data science. Bartnik’s package conceivably could have lower overhead, making parallelization speedup from threaded R more feasible.

For me, a major piece of unfinished business regarding Rdsm is the use of backing storage, i.e. storing the shared variables on disk. This could help with problems having large memory needs, and may be useful for distributed computation. Rdsm runs on top of bigmemory, which does allow use of backing storage. However, this seems to require a file system that immediately propagates file changes made by one process to visibility by other processes, which didn’t seem to work on ordinary Linux systems, for instance.

As to truly threaded R, my understanding (could need an update) is that the R Core team has vowed “Never!” Too many technical issues.

Using CART: Implementation Matters

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:

> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, rpart did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class')
> rpypred <- predict(rplr,type='class')
> mean(rpypred == lr$lettr)
[1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr)
> rfypred <- predict(rflr)
> mean(rfypred == lr$lettr)
[1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit)
> ctout <- ctree(lettr ~ .,data=lr)
> ctpred <- predict(ctout,lr)
> mean(ctpred == lr$lettr)
[1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.

My regtools Package Is Now on CRAN

In my posts to this blog (less frequent than I would like, hopefully more frequent in the future), I’ve often mentioned my R package regtools, which contains a number of functions useful for regression and classification. None of them duplicate what is available in the excellent packages on CRAN, so I will dare characterize regtools as innovative. 🙂 In spite of the smiley emoticon in that last sentence, I hope you feel the same way.

You can download the package from CRAN, and yes, Joe R., it includes a vignette. 🙂 It begins with an example of one of the main features of the package, the use of nonparametric regression estimation to assess the fit of parametric regression models. I think/hope you’ll find it interesting.

Manny Parzen Used R!

Prof. Manny Parzen, a pioneer of modern statistics, passed away in February, aged 87. I should have commented back then, but it’s still worth saying something today. I happened to be thinking of him this morning.

I did not know Manny personally. This makes it odd that I refer to him by his first name, but I do so in the same spirit that people who are fans of Bernie Sanders but don’t know him personally, including myself, refer to him as Bernie. This right away tells you something about Manny.

It also says something about Manny that he stayed active long after retirement age — and that he was actually using R! That is remarkable for a theoretician of his era, and indeed maybe even for the present era.

I discovered this at the 43rd Symposium on the Interface of
Computing Science and Statistics
in 2012, where there was a special session honoring him. A couple of his former PhD students spoke, including my old professor, Don Ylvisaker, and Manny himself gave a talk.

Manny was pretty frail at the time, using a walker, yet mentally sharp as ever, and as I mentioned, still quite active in research. In fact, the young man sitting next to me turned out to be a current PhD student working under Manny.

At any rate, what jolted me was that in the midst of his talk, on the subtle theoretical properties of histograms and the like, he suddenly started talking about R! Specifically, he brought in R’s quantile() function, and went into a comparison of the 9 different options quantile() offers the user for defining quantiles. This really impressed me, and illustrates how much R has become a central tool.

By the way, after the talk I went up and introduced myself, and told him a joke that he had told many years ago when he gave a talk in the UC Davis Stat Dept. I won’t waste bandwidth here by going into the details of the joke, but it concerned students who guess randomly on exams. He laughed heartily at his own joke, and said, “It’s still true!”

StatET IDE for R

I personally do not use Integrated Development environments (IDEs) for R, or for that matter for any programming language. From my point of view, they take up too much precious real estate on the screen, and most important, they generally do not allow me to use my own text editor and my own abbreviations and macros. Since I want to have a uniform editing environment no matter whether I am programming, typing e-mail or whatever, IDEs are not for me. I just use Vim, with my own abbreviation collection.

So for example, I’ve never used RStudio (gasp!). I know many people who swear by it (and maybe a few who swear at it 🙂 ), but it’s just not for me. ESS (Emacs Speaks Statistics) is great for the hard core R people, and in fact I occasionally use it — I have an Emacs tutorial, by the way — but not too much.

If you do want to use an IDE for R, and want something different, I’ve always admired (from afar) StatET, for which there is a nice introduction in progress. It does take some configuration, and Eclipse requires some learning (I have a tutorial for it too), but StatET has always impressed me as very well-designed.

Disclaimer: I do not know whether the StatET article’s claims about RStudio are correct. My point is that the article is a nice introduction to StatET.

New Release of partools Package

My new release of partools is now on CRAN.

The package is aimed at doing parallel data science in what I call an “un-MapReduce” manner. It takes the point of view that MapReduce-based frameworks such as Hadoop and Spark are fine for the types of applications their designers had in mind, namely rather simple SQL actions, but have fundamental handicaps that prevent them from performing well on many, if not most, of the types of computation that typical users need for large data sets and/or highly compute-bound applications. The distributed file/object nature of those MapReduce systems is retained, but the confining MapReduce computational paradigm is avoided.

The package now contains about 30 functions, ranging from infrastructure support to summary and aggregation to statistical/machine learning applications. See the vignette for a fairly detailed introduction. Two new capabilities that I wish to highlight are:

  • Aggregation and related operations on objects of class “data.table”.
  • Parallel computation for some modern statistical/machine learning algorithms (they are statistics to me, but you may call them machine learning if you prefer).

The core of that second highlighted set of functions makes use of what I call Software Alchemy, which I have explained in previous blog posts. See for instance the example on random forests in the vignette.

Happy Paralleling. 🙂