A Partial Remedy to the Reproducibility Problem

Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden lab-to-lab variability. In addition, there is the occasional revelation of outright fraud.

Many consider the field to be at a crisis point.

In the 2014 JSM, Phil Stark organized a last-minute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standing-room-only crowd.

In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.

There are two main aspects:

  • The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
  • Warn of statistical   errors, such as: overreliance on p-values; need for multiple inference procedures; possible distortion due to outliers; etc.

The term user here could refer to several different situations:

  • The various authors of a study, collaborating and trying different analyses during the course of the study.
  • Reviewers of a paper submitted for publication on the results of the study.
  • Fellow scientists who wish to delve further into the study after it is published.

The package has text and GUI versions. The latter is currently implemented as an RStudio add-in.

The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.

Advertisements

Online But In-Class Examinations (with an R Example)

About a year-and-a-half ago, some students and I wrote OMSI, Online Measurement of Student Insight, an online software tool to improve examinations for students and save instructors lots of time and drudgery currently spent on administering exams. It is written in a mixture of Python and R. (Python because it was easier to find students for the project, R because it built upon a earlier system I had developed entirely in R.)

I will describe it below, but I wish to say at the outset: I NEED TESTERS! I’ve used it successfully in several classes of my own so far, but it would be great to get feedback from others. Code contribution would be great too.

From the project README file:

Students come to the classroom at the regular class time, just as with a traditional pencil-and-paper exam. However, they use their laptop computers to take the exam, using OMSI. The latter downloads the exam questions, and enables the students to upload their answers.

This benefits students. Again from the README:

  • With essay questions, you have a chance to edit your answers, producing more coherent, readable prose. No cross-outs, arrows, words squeezed in above a line, no points off for unreadable handwriting. 🙂
  • With coding questions, you can compile and run your code, giving you a chance to make corrections if your code doesn’t work.

In both of these aspects, OMSI gives you a better opportunity to demonstrate your insight into the course material, compared to the traditional exam format.

It is a great saver of time and effort for instructors, as the README says:

OMSI will make your life easier. 🙂

OMSI facilitates exam administration and grading. It has two components:

  • Exam administration. This manages the actual process of the students taking the exam. You get electronic copies of the students’ exams, eliminating the need for collecting and carrying out out a large number of papers, and making work sharing much easier among multiple graders. As noted in “Benefits for students” above, OMSI enables the student to turn in a better product, and this benefits the instructor as well: Better exam performance by students is both more gratifying to the instructor and also makes for easier, less frustrating grading.
  • Exam grading. OMSI does NOT take the place of instructor judgment in assigning points to exam problems. But it does make things much easier, by automating much of the drudgery. For instance, OMSI automatically records grades assigned by the instructor, and automatically notifies students of their grades via e-mail. Gone are the days in which the instructor must alphabetize the papers, enter the grades by hand, carry an armload (or boxload) of papers to give back to students in class, retaining the stragglers not picked up by the students, and so on.

Here is an R example showing sample exam questions. At the server, the instructor would place the following file:


     
QUESTION -ext .R -run "Rscript omsi_answer1.R"

Write R code that prints out the mean of R's 
built-in Niles dataset, starting with 
observation 51 (year 1921).
                                                                                QUESTION -ext .R -run "Rscript omsi_answer2.R"
        
Write an R function with call form g(nreps) 
that will use simulation to find the approximate value of E(|X - Y|) for 
independent N(0,1) variables X and Y.  
Here nreps is the number of 
replications.  Make sure to include 
a call print(g(10000)) in 
your answer, which will be run by
OMSI.

QUESTION -ext .R -run "Rscript omsi_answer3.R"                                   

Suppose X ~ U(0,1). Write an R function with 
call form g(t) which finds the density of X^2 
at t, for t in (0,1).  Make sure to include
a call print(g(0.8)) in your answer, which 
will be run by OMSI.

QUESTION

Suppose an article in a scientific journal states, "The treatment and nontreatment 
means were 52.15 and 52.09, with a p-value 
of 0.02.  So there is only a 2% chance that the treatment has no effect."  Comment on the propriety of that statement.

At the client side, the student would see this:

 

After the student enters an answer and hits Save and Run, the student’s code would be run in a pop-up window, displaying the result. When the student hits Submit, the answer is uploaded to the instructor’s server. There is a separate directory at the server for each student, and the answer files are stored there.

Again, the autograder does NOT evaluate student answers; the instructor does this. But the autograder greatly facilitates the process. The basic idea is that the software will display on the screen, for each student and each exam problem, the student’s answer.  In the case of coding questions, the software will also run the code and display the result.  In each case, the instructor then inputs the number of points he/she wishes to assign.

The package is easy to install and use, from both the student and instructor point of view. See the README for details.

 

A Python-Like walk() Function for R

A really nice function available in Python is walk(), which recursively descends a directory tree, calling a user-supplied function in each directory within the tree. It might be used, say, to count the number of files, or maybe to remove all small files and so on. I had students in my undergraduate class write such a function in R for homework, and thought it may be interesting to present it here.

Among other things, readers who are not familiar with recursive function calls will learn about those here. I must add that all readers, even those with such background, will find this post to be rather subtle and requiring extra patience, but I believe you will find it worthwhile.

Let’s start with an example, in which we count the number of files with a given name:


countinst <- function(startdir,flname) {
   walk(startdir,checkname,
      list(flname=flname,tot=0))
}

checkname <- function(drname,filelist,arg) {
   if (arg$flname %in% filelist) 
      arg$tot <- arg$tot + 1
   arg
}

Say we try all this in a directory containing a subdirectory mydir that consists of a file x, and two subdirectories, d1 and d2. The latter in turn consists of another file named x. We then make the call countinst(‘mydir’,’x’).  As can be seen above, that call will in turn make the call


walk('mydir',checkname,list(flname='x',tot=0)

The walk() function, which I will present below, will start in mydir, and will call the user-specified function checkname() at every directory it encounters in the tree rooted at mydir, in this case mydir, d1 and d2.

At each such directory, walk() will pass to the user-specified function, in this case checkname(), first the current directory name, then a character vector reporting the names of all files (including directories) in that directory.  In mydir, for instance, this vector will be c(‘d1′,’d2′,’x’).  In addition, walk() will pass to checkname() an argument, formally named arg above, which will serve as a vehicle for accumulating running totals, in this case the total number of files of the given name.

So, let’s look at walk() itself:


walk <- function(currdir,f,arg) {
   # "leave trail of bread crumbs"
   savetop <- getwd()
   setwd(currdir)
   fls <- list.files()
   arg <- f(currdir,fls,arg)
   # subdirectories of this directory
   dirs <- list.dirs(recursive=FALSE)
   for (d in dirs) arg <- walk(d,f,arg)
   setwd(savetop) # go back to calling directory
   arg  
}    

The comments are hopefully self-explanatory, but the key point is that within walk(), there is another call to walk()! This is recursion.

Note how arg accumulates, as needed in this application. If on the other hand we wanted, say, simply to remove all files named ‘x’, we would just put in a dummy variable for arg. And though we didn’t need drname here, in some applications it would be useful.

For compute-intensive tasks, recursion is not very efficient in R, but it can be quite handy in certain settings.

If you would like to conveniently try the above example, here is some test code:


test <- function() {
   unlink('mydir',recursive=TRUE)
   dir.create('mydir')
   file.create('mydir/x')
   dir.create('mydir/d1')
   dir.create('mydir/d2')
   file.create('mydir/d2/x')
   print(countinst('mydir','x'))
}

 

Discrete Event Simulation in R (and, Why R Is Different)

I was pleased to see the announcement yesterday of simmer 3.61. a discrete-event simulation (DES) package for R. I’ve long had an interest in DES, and as I will explain below, implementing DES in R brings up interesting issues about R that transcend the field of DES. I had been planning to discuss them in the context of my own DES package for R, and the above announcement will make a good springboard for that.

First, what is DES? It is simulation of stochastic processes with discrete state space. Consider for instance the classic M/M/1 queue: exponential interarrival and service times, with a single server. The state space can be defined as the queue length, which is integer-valued and thus “discrete.” This contrasts to, say, simulating a weather system, where state such as temperature is continuous.

The key component of such a simulator, no matter which programmer world view the software takes (see below) is the event list.  At any simulated time t, the event list records all the events that are supposed to happen after time t. In the M/M/1 queue example, for instance, at time 168.0, there might be a service completion scheduled for time 169.1, and an arrival at 182.2.

The main loop of a DES program then consists of:

  • Remove the earliest event e in the event list; e will be represented as some kind of data structure, including event time, event type and so on.
  • Update the current simulated time to the time in e.
  • Simulate the execution of e.

The looping continues until the user-specified maximum simulation time is exceeded.

The M/M/1 case is simple (not to mention having a closed-form theoretical solution), but in complex systems the event list can be quite large, say thousands of events. In such cases, good performance means executing the first of the above three bulleted items efficiently. Classically, there has been much research on this (including theoretical models using renewal theory and the like). But what about doing this in R?

The simmer package handles this by…NOT doing it in R. 🙂 If one needs the performance, this is the obvious route to take. (This is not quite true in a certain sense, also explained below.)

I developed my DES package entirely in R, mainly because I aimed it only as proof-of-concept. It’s been around for a few years, first on my own Web page then more recently on GitHub. I did it for use by my students, and because it seemed that periodically there have been questions on r-help along the lines of “Is there a DES package available for R?”

Most algorithms for handling event lists use some kind of priority queue, implemented a a binary tree. Since R lacks pointers, it is not easy to develop such thing, much less do it efficiently. So I just chose to implement the event list in DES as straight R vector, maintained in sorted order. But when a new event is inserted, a time-consuming operation ensues, due to the need to keep the event list in ascending order. Originally, I implemented this as binary search.

Later I realized that this was anti-R. I knew it would be slow, of course, but didn’t think much about alternatives. Then later it occurred to me:

  • Just add new events at the tail end.
  • Don’t keep the event list in sorted order.
  • In first bullet above in the event loop, simply find the earliest event by calling R’s which.min().

True, which.min() does an inefficient linear search. But it does it at C (not sea) level! Unless the event list is larger than any one I know of in practice, this will be a win.

Now, what about my pure-R DES package vs. simmer, which does its core operations in C++? The simmer package ought to be faster, right? Actually, yes, but one must first understand that they actually are not comparable, as follows.

There are two main programming paradigms (“world views”) in DES. Let’s illustrate that with M/M/1:

  • Event-oriented: Here the code explictly recognizes how one event triggers others. For a job arrival in M/M/1, the code to react to that arrival will see whether to add the job to the server queue, vs. starting it now if the queue is empty, and that same code will schedule the next job arrival.
  • Process-oriented. Here each entity more or less “minds its own business,” with fewer lines of code that explicitly show interactions between entities. In M/M/1 we might have an arrival process function and a server process function. The latter function might watch the queue continually to see when it becomes nonempty, and the former function might add the newly-arriving job to the queue, but NOT start service for the job in the case of an empty queue, as would be the case for the event-oriented approach.

The pros and cons are: The event-oriented approach is much easier to implement, but arguably less clear. The process-oriented approach requires threading of some kind (not necessarily the “classical” kind), but most people, including me, consider it clearer and more elegant.

The simmer package is process-oriented, and in fact is modeled on SimPy, a popular DES library written in Python. I’m a big fan of SimPy, which is another reason why I like simmer.

HOWEVER, the process-oriented approach, ceteris paribus, tends to be slow. This is due to various reasons related to the threading, but at any rate, the event-oriented approach, for all its inelegance, does tend to excel somewhat in this regard.

My DES package is event-oriented. So, I was curious which package would be faster, the pure-R event-oriented code or the R-calls-C++ process-oriented code. So I did a small experiment. (Disclaimer: Not claimed to generalize.) Both packages include M/M/1 examples, so I ran them for various values of the mean interarrival and service rates. I won’t enumerate the results here, but generally the C++/process-oriented runs were about 25-50% faster than the R/event-oriented ones.

There may be many issues here. For instance, DES’ deletion of an event involves code like

 simlist$evnts <- simlist$evnts[-i,,drop=F]

This may involve reallocating memory for the entire matrix.

I will leave all this as an exercise for the reader. 🙂

 

Any Forward Progress on p-Values?

Statisticians have long known that the use of p-values has major problems. Some of us have long called for reform, weaning the profession away from these troubling beasts. At one point, I was pleased to see Frank Harrell suggest that R should stop computing them.

That is not going to happen, but last year the ASA shocked many people by producing a manifesto stating that p-values are often wrongly used, and in any case used too much. Though the ASA statement did not go so far as to recommend not using this methodology at all, to me it came pretty close. Most significant to me (pardon the pun) was the fact that, though the ASA report said p-values are appropriate in some situations, they did not state any.examples. I wrote blog posts on this topic here, here and here. I noted too that the ASA report had even made news in Bloomberg Businessweek.

But not much seems to have changed in the professions since then, as was shown rather dramatically last Saturday. The occasion was iidata 2017, a student-run conference on data science at UC Davis. This was the second year the conference has been held, very professionally managed and fun to attend. There were data analysis competitions, talks by people from industry, and several workshops on R, including one on parallel computing in R, by UCD Stat PhD student Clark Fitzgerald.

Now, here is how this connects to the topic of p-values. I was a judge on one of the data analysis competitions, and my fellow judges and I were pretty shocked by the first team to present. The team members were clearly bright students, and they gave a very polished, professional talk. Indeed, we awarded them First Prize. However…

The team presented the p-values for various tests, not mentioning any problems regarding the large sample size, 20,000. During the judges’ question period, we asked them to comment on the effect of sample size, but they still missed the point, saying that n = 20,000 is enough to ensure that each cell in their chi-squared test would have enough observations! After quite a lot of prodding, one of them finally said there may be an issue of practical significance vs. statistical significance.

Once again, we cannot blame these bright, energetic students. They were simply doing what they had been taught to do — or, I should say, NOT doing what they had NOT been taught to do, which is to view p-values with care, especially with large n. The blame should instead be placed on the statistics faculty who taught them. The dangers of p-values should have been constantly drilled into them in their coursework, to the point at which a dataset with n = 20,000 should have been a red flag to them.

On this point, I’d like to call readers’ attention to the ASA Symposium on Statistical Inference, to be held in Bethesda, MD on October 11-13, 2017. Under the leadership of ASA Executive Director Ron Wasserstein, we are in the process of putting together what promises to be a highly stimulating and relevant program, with prominent speakers, and most important, lots of interactive participation among attendees. Hopefully much of the discussion will address ways to get serious coverage of the issue into statistics curricula, rather than the present situation, in which stat instructors briefly make a one-time comment to students about practical significance vs. statistical significance.

Threading in R?

I was pleased to see today’s post, “(A Very) Experimental Threading in R,” by Lukasz Bartnik, as this is a long-standing interest of mine. My own effort in this direction has been my package Rdsm.

The notion of threading, for those who may not have this background, refers to several instances of a program, in this case, several instances of R, sharing global variables but otherwise running independently. As Bartnik points out, this can make I/O programming easier and clearer; see my Python tutorial, Chapter 4, for a network sockets example, in which the code must deal with situations in which data may come from one of many sockets, but without foreknowledge of which socket will be next. By having a separate process devoted to each socket, but storing the incoming data in a shared variable, the problem is neatly solved (and more conveniently than using nonblocking I/O).

R examples of various sorts are given in the Rdsm package. And as Bartnik also points out, perhaps the most common situation where his or my package might be used is running an R “background job.”

In terms of speedups through parallelization, the results are mixed. See my book on parallel computation for data science. Bartnik’s package conceivably could have lower overhead, making parallelization speedup from threaded R more feasible.

For me, a major piece of unfinished business regarding Rdsm is the use of backing storage, i.e. storing the shared variables on disk. This could help with problems having large memory needs, and may be useful for distributed computation. Rdsm runs on top of bigmemory, which does allow use of backing storage. However, this seems to require a file system that immediately propagates file changes made by one process to visibility by other processes, which didn’t seem to work on ordinary Linux systems, for instance.

As to truly threaded R, my understanding (could need an update) is that the R Core team has vowed “Never!” Too many technical issues.

Using CART: Implementation Matters

In preparing the following example for my forthcoming book, I was startled by how different two implementations of Classification and Regression Trees (CART) performed on a particular data set. Here is what happened:

For my example, I first used the Vertebral Column data set from the UCI Machine Learning Repository. The task is to classify patients into one of three vertebral categories. There are 310 observations, with only 6 predictor variables, so there should be no problem of overfitting. Using a straight logistic model, I achieved about 88% accuracy.

I then tried CART, using the rpart package. (Note, throughout the book, I try to stick to default values of the arguments.) Here is the code:


> vert <- read.table('column_3C.dat',header=FALSE)
> library(rpart)
> rpvert <- rpart(V7 ~ .,data=vert,method='class')
> rpypred <- predict(rpvert,type='class')
> mean(rpypred == vert$V7)
[1] 0.883871

OK, very similar.

Then I tried the well-known Letters Recognition data set from UCI. This too is a classification problem, one class for each of the capital letters ‘A’ through ‘Z’, with 20,000 observations. (The letters are represented about equally frequently in this data, so the priors are ‘wrong’, but that is not the issue here.) There are 16 predictor variables. I got about 84% accuracy, again using logit (All vs. All).

However, rpart did poorly:

> rplr <- rpart(lettr ~ .,data=lr,method='class')
> rpypred <- predict(rplr,type='class')
> mean(rpypred == lr$lettr)
[1] 0.4799

Of course, potential deficiences in CART led the original developers of CART to the notion of random forests, so I gave that a try.

> rflr <- randomForest(lettr ~ .,data=lr)
> rfypred <- predict(rflr)
> mean(rfypred == lr$lettr)
[1] 0.96875

Really? Can there be that vast a difference between CART and random forests? And by the way, I got about 91% accuracy with k-Nearest Neighbors (implemented in the knnest function from my regtools package on CRAN).

I speculated that the cause might be that the response variables here (26 of them) are non-monotonically related to the predictors. I put that theory to a couple of the originators of CART, Richard Olshen and Chuck Stone, but they didn’t seem to think it is an issue. But while it is true that nonmonotonicity should eventually be handled by predictors being split multiple times, I still suspect it could be the culprit, say due to the tree-building process stopping too early.

On that hunch, I tried another implementation of CART, ctree from the partykit package, which uses quite different splitting and stopping rules. This was considerably better:

> library(partykit)
> ctout <- ctree(lettr ~ .,data=lr)
> ctpred <- predict(ctout,lr)
> mean(ctpred == lr$lettr)
[1] 0.8552

Hmm…Not sure what to make of this, but it certainly is thought-provoking.

By the way, partykit includes a random forest implementation as well, but it is slow and can be a memory hog. The authors still consider it experimental.