Heteroscedasticity in Regression — It Matters!

R’s main linear and nonlinear regression functions, lm() and nls(), report standard errors for parameter estimates under the assumption of homoscedasticity, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor variables.

Take for instance my favorite introductory example, predicting human weight from height. (It is unlikely we’d need to make such a prediction, but it serves as a quick and easy illustration.) The homoscedasticity assumption says that tall people have no more variation in weight than short people, certainly not true.

So, confidence intervals and p-values calculated by lm() and nls() are generally invalid. In this blog post, I’ll show how to remedy this problem — including some new code that I’ll provide here — and give an example showing just how far wrong one can be without a remedy.

In the linear case, the remedy is simple, but probably not widely known. Years ago, researchers such as Eickert and White derived large-sample theory for the heteroscedastic variance case, and it is coded in CRAN’s car and sandwich packages in the functions hccm() and vcovHC(), respectively. I’ll use the latter here, as its name is similar to that of R’s vcov() function.

The latter inputs the result of a call to lm() or nls(), and outputs the estimated covariance matrix of your estimated parameter vector. Thus the standard errors of the estimated parameters are the square roots of the diagonal elements of the matrix returned by vcov(). Let’s check that, using the example from the online help for lm():

> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
> group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
> weight <- c(ctl, trt)
> lm.D9 <- lm(weight ~ group)
> summary(lm.D9)
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.2202 22.850 9.55e-15
groupTrt -0.3710 0.3114 -1.191 0.249
> diag(sqrt(vcov(lm.D9)))
(Intercept) groupTrt 
 0.2202177 0.3114349 

Sure enough, the standard errors match. ¬†(Were you doubting it? ūüôā )

So, the linear case, i.e. lm(), is taken care of. ¬†But unfortunately, there doesn’t seem to be anything for the nonlinear case, nls(). The following code solves that problem:


nlsvcovhc <- function(nlslmout) {
 b <- coef(nlslmout)
 m <- nlslmout$m
 resid <- m$resid()
 hmat <- m$gradient()
 fakex <- hmat
 fakey <- resid + hmat %*% b
 lmout <- lm(fakey ~ fakex - 1)

(The names such as fakex, which I should remove, are explained in the last paragraph below.) You can download this code (complete with comments, which I’ve omitted here for brevity) here. The function is applied to the output of nlsLM(), which is a¬†modified version of¬†nls()¬†in the package¬†minpack.lm, which has better convergence behavior.

Here’s an example, using the enzyme data set vmkmki from the CRAN package nlstools. (I removed the last 12 observations, which for reasons I won’t go into here seem to be anomalous.) The nonlinear model here is the one given in the online help for vmkmki.

> regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3))
> bstart <- list(b1=1,b2=1, b3=1)
> z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,start=list(b1=1,b2=1, b3=1))
> z
Nonlinear regression model
 model: v ~ regftn(S, I, b1, b2, b3)
 data: vmkmki
 b1 b2 b3 
18.06 15.21 22.28 
> vcov(z)
 b1 b2 b3
b1 0.4786776 1.374961 0.8930431
b2 1.3749612 7.568837 11.1332821
b3 0.8930431 11.133282 29.1363366

Compare that to using Eickert-White:

ttt# get new z from nlsLM(), not shown
> nlsvcovhc(z)
 fakex1 fakex2 fakex3
fakex1 0.4708209 1.706591 2.410712
fakex2 1.7065910 10.394496 20.314688
fakex3 2.4107117 20.314688 53.086958

This is rather startling! Except for the estimated variance of the first parameter estimate, the estimated variances and covariances from Eickert-White are much larger than what nls()} found under the assumption of homoscedasticity. In other words, nls() was way off the mark. In fact, this simulation code shows that the standard errors reported by nls() can lead, for instance, to confidence intervals having only 60% coverage probability rather than 90%, due to heteroscedasticity, while the above code fixes the problem.

How does that code work? (Optional reading.) Most nonlinear least-squares procedures use a local linear approximation, such as in the Gauss-Newton algorithm. This results computationally in a fake lm() setting. As such, we are already set up to use the delta method. We can then apply vcovHC().


Macros in R

In programming, sometimes it’s useful to write a macro rather than a function.¬†(Don’t worry if you’ve never heard the term before.)¬†In this post, I’ll give an example of use of macros in R. using the gtools package on CRAN.

I wanted to write some utility code to help me reuse my earlier R commands during an interactive R session. Most (though not all) of what I wanted is already provided in the excellent R user interface systems such as ESS, RStudio and vim-r, but for various reasons I generally use the command line directly, especially for short sessions. (I also have developed my own vim editing mappings for R.) Specifically, I wanted to develop utilities to perform the following tasks:

  • Display all my recent commands, possibly restricting to those matching or excluding certain character strings.
  • Choose one of the recent commands for re-execution.
  • Re-execute a recent command by number or matching string.

One nice thing about R is that one can easily form some command programmatically in a character string, say using paste(), and then execute the string as a command, using eval(). Thus it would be easy to code up the above-listed tasks into functions that I can call when needed, except for one problem: Within the body of a function, one has a different environment than at the caller’s level.

Take for instance (a slightly modified version of) the first example in the online help for defmacro() in gtools, in which to goal is to write a function to recode as NAs all entries in a data frame column with a certain value. The following will NOT work:

setNA <- function(df,var,val) 
   df$var[df$var == val] <- NA

The problem is that within setNA(), df will be a data frame that is only a copy of the one in your call. The recoding to NAs will be made only to the copy. The defmacro() function in gtools would do what you really want:

setNA <- defmacro(df,var,val,
   expr={df$var[df$var == val] <- NA}

With this, df really will be the desired data frame. I urge you to try a little test with both of the above code snippets.

The code I wrote for my command-history utilities is rather short, but too long to place in a blog post; instead, you can access it here. There is a short sample session at the end of the file, which again I urge you to execute by hand to fully understand it. Take note of the tasks I coded as macros instead of functions, and think about why I needed to do so.

How is all this magic accomplished? The defmacro() function still makes use of eval(), substitute() etc., but it already does that work for you, so that you can write your macro while thinking of it as function. Your code is a function — defmacro() builds the function and returns it, and by the way note that that means it is debugable — but again, the point is that you don’t have to deal with all the calls to eval() etc.

If you are a C/C++ programmer, note that this differs from macros in that language, which are straight substitutions made by the compiler preprocessor.


Discovered Two Great Web Sites Today

Today is my lucky day.  I learned of two very interesting Web pages, both of them quite informative and the first of them rather provocative (yay!). I have some comments on both, in some cases consisting of mild disagreement, which I may post later, but in any event, I highly recommend both.  Here they are:

Tip: Read Drew’s page as soon as possible, before he removes the F-bombs. ūüôā