R’s main linear and nonlinear regression functions, **lm()** and **nls(),** report standard errors for parameter estimates under the assumption of *homoscedasticity*, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor variables.

Take for instance my favorite introductory example, predicting human weight from height. (It is unlikely we’d need to make such a prediction, but it serves as a quick and easy illustration.) The homoscedasticity assumption says that tall people have no more variation in weight than short people, certainly not true.

So, confidence intervals and p-values calculated by **lm()** and **nls()** are generally invalid. In this blog post, I’ll show how to remedy this problem — including some new code that I’ll provide here — and give an example showing just how far wrong one can be without a remedy.

In the linear case, the remedy is simple, but probably not widely known. Years ago, researchers such as Eickert and White derived large-sample theory for the heteroscedastic variance case, and it is coded in CRAN’s **car** and **sandwich** packages in the functions **hccm()** and **vcovHC()**, respectively. I’ll use the latter here, as its name is similar to that of R’s **vcov()** function.

The latter inputs the result of a call to **lm()** or **nls()**, and outputs the estimated covariance matrix of your estimated parameter vector. Thus the standard errors of the estimated parameters are the square roots of the diagonal elements of the matrix returned by **vcov()**. Let’s check that, using the example from the online help for **lm()**:

> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) > trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) > group <- gl(2, 10, 20, labels = c("Ctl","Trt")) > weight <- c(ctl, trt) > lm.D9 <- lm(weight ~ group) > summary(lm.D9) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.2202 22.850 9.55e-15 groupTrt -0.3710 0.3114 -1.191 0.249 > diag(sqrt(vcov(lm.D9))) (Intercept) groupTrt 0.2202177 0.3114349

Sure enough, the standard errors match. (Were you doubting it? 🙂 )

So, the linear case, i.e. **lm(),** is taken care of. But unfortunately, there doesn’t seem to be anything for the nonlinear case, **nls()**. The following code solves that problem:

library(minpack.lm) library(sandwich) nlsvcovhc <- function(nlslmout) { b <- coef(nlslmout) m <- nlslmout$m resid <- m$resid() hmat <- m$gradient() fakex <- hmat fakey <- resid + hmat %*% b lmout <- lm(fakey ~ fakex - 1) vcovHC(lmout) }

(The names such as **fakex**, which I should remove, are explained in the last paragraph below.) You can download this code (complete with comments, which I’ve omitted here for brevity) here. The function is applied to the output of **nlsLM()**, which is a modified version of **nls()** in the package **minpack.lm**, which has better convergence behavior.

Here’s an example, using the enzyme data set **vmkmki** from the CRAN package **nlstools**. (I removed the last 12 observations, which for reasons I won’t go into here seem to be anomalous.) The nonlinear model here is the one given in the online help for **vmkmki**.

> regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3)) > bstart <- list(b1=1,b2=1, b3=1) > z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,start=list(b1=1,b2=1, b3=1)) > z Nonlinear regression model model: v ~ regftn(S, I, b1, b2, b3) data: vmkmki b1 b2 b3 18.06 15.21 22.28 ... > vcov(z) b1 b2 b3 b1 0.4786776 1.374961 0.8930431 b2 1.3749612 7.568837 11.1332821 b3 0.8930431 11.133282 29.1363366

Compare that to using Eickert-White:

ttt# get new z from nlsLM(), not shown > nlsvcovhc(z) fakex1 fakex2 fakex3 fakex1 0.4708209 1.706591 2.410712 fakex2 1.7065910 10.394496 20.314688 fakex3 2.4107117 20.314688 53.086958

This is rather startling! Except for the estimated variance of the first parameter estimate, the estimated variances and covariances from Eickert-White are much larger than what **nls()**} found under the assumption of homoscedasticity. In other words, **nls()** was way off the mark. In fact, this simulation code shows that the standard errors reported by **nls()** can lead, for instance, to confidence intervals having only 60% coverage probability rather than 90%, due to heteroscedasticity, while the above code fixes the problem.

How does that code work? (Optional reading.) Most nonlinear least-squares procedures use a local linear approximation, such as in the Gauss-Newton algorithm. This results computationally in a fake **lm()** setting. As such, we are already set up to use the *delta method*. We can then apply **vcovHC().**

Just a small question here out of curisoity: I have never fully understood why one would use heteroskedasticity-robust standard errors in the non-linear case. For instance, if I estimate a standard probit model, to get at the coefficient estimates, we need to assume iid error terms to be able to represent the joint likelihood as the product of the likelihood of an individual observation, which we can then handle more easliy. By computing robust standard errors for the probit, you are saying that you should not trust the coefficients (since they have been derived with an iid assumption). What am I missing / misunderstanding here?

In a probit or logit model, the variance structure is a known function of the mean (i.e. of the regression function), and that fact is used in, for instance, the iteratively reweighted least squares algorithm for computation.

In a general nonlinear regression analysis, there is no assumed covariance structure, thus no likelihood function (even if one wants to assume normal errors, which I prefer not to do, and which really isn’t necessary). The standard analyses, such as that of

nls()in R, assume constant variance. This assumption is in most real cases not even approximately true, and the simulation in my original post shows how this can badly distort the reported standard errors, such as those obtained fromvcov()fornls(). This problem occurs regardless of whether one’s model is linear or nonlinear. This was the motivation for the Eickert-White method for the linear case, but as I explained, this can easily be extended to the nonlinear case.By the way, speaking of i.i.d. in the probit model, the observations are non-identically distributed only conditional on X (the vector of predictor variables). In the so-called “random X” case, the vector (Y,X) is assumed to be sampled from a joint population distribution, in which setting the observations actually are identically distributed.

It’s hard to discuss this in depth in this small space, but if you wish to discuss it in more detail, let’s do so by e-mail.

…this doesn’t actually explain anything.