Use of freqparcoord for Regression Diagnostics

This is the third in my series of three posts on my package freqparcoord with Yingkang Xie. (My next post after this will show how to use R to explore one of my favorite examples of “what can go wrong” in statistics.)

Here is a very brief review of my previous posts regarding freqparcoord. A parallel coordinates plot draws one vertical axis for each variable, and then draws each data point as a segmented line connecting the values of the variables for that data point. This typically results in heavy screen clutter, but freqparcoord solves the problem by plotting only the data points with the largest estimated multivariate density value.

The package can also be used for outlier hunting (it plots the points with the smallest density values) and cluster hunting (the points corresponding to local maxima of the density are plotted).

In this posting, I will demonstrate yet another use for the package: regression diagnostics. Note that the term regression here means the conditional mean, so classification problems, e.g. logistic regression, are covered too. Specifically, the package includes a function regdiag() that is used to assess fit of a parametric model, using the main function of the package, freqparcoord().

The basic idea is this: The function computes nonparametric estimates of the regression function at all data points, and then computes the difference between the parametric and nonparametric fits at each point; we call those differences the divergences. The latter become the first vertical axis, and then each of the predictor variables has its own vertical axis. The resulting plots then give the user a picture of where in the predictor space the parametric model often overpredicts, and where it often underpredicts. (Note the role of the nonparametric estimates, so that the divergences are NOT the residuals, and note that we are only plotting the “most typical” values of the divergences, due to the use of density by freqparcoord.)

In addition, in the case in which we make use of the output of R’s lm() linear regression function, regdiag() also makes available the adjusted R2 value for the linear model and a corresponding value (squared correlation of actual Y and predicted Y) for the nonparametric fit.   If the nonparametric fit is  substantially better here, it is an indication that our parametric model may be “leaving money on the table,” i.e. there is room for improvement, say by adding second-order terms.

We use k-NN for the nonparametric fit, by the way, and the estimate at a point uses only genuine neighbors, i.e. the point itself is excluded.  Future versions may use a local-linear approximation, in order to deal with aliasing effects.  All data is centered and scaled.

Here an example, using the prgeng data set from the package.  This is 2000 Census data for programmers and engineers in Silicon Valley.   The R code is

data(prgeng)
pg <- prgeng
pg1 <- pg[pg$wageinc >= 40000 & pg$wkswrkd >= 48,]
pg1$ms <- as.integer(pg1$educ == 14)
pg1$phd <- as.integer(pg1$educ == 16)
pg1$csee <- as.integer(pg1$occ == 140 | pg1$occ == 141)
pg1$msphd <- pg1$ms + pg1$phd
l1 <- lm(wageinc ~ age+msphd+csee+sex,data=pg1)
p <- regdiag(l1,tail=0.40)
p
p$paramr2
p$nonparamr2

What does this code do? It: loads the data; screens out the odd cases; creates dummy variables for MS/PhD degree and CS/EE job; runs lm(); plugs the output of lm() into regdiag(); “prints” the graph on the screen; and prints out the parametric and nonparametric R2 values.

The latter values are 0.055 and 0.062. These are both low, but the point here is that it does appear that there is room for improvement in the parametric model, possibly by adding second-order terms to the model. Well, then, which second-order terms? The graph gives us some insight here:

blog041414a

The regdiag() function plots the typical large negative divergences on top, the typical large positive ones on the bottom, and the middling ones in the middle. We see a sharp difference in divergences between the top and bottom plots for the Age variable, but not for the others. This suggests adding an Age2 term to the parametric model, but not adding any interaction terms. After doing this (not shown here), the adjusted R2 value for the parametric model becomes comparable to the nonparametric one.

Let’s do a simulation, so we can see how this all works in a known setting:

n <- 10000
x <- matrix(rnorm(3*n),ncol=3)
x <- cbind(x,sample(0:1,n,replace=T))
x <- cbind(x,sample(0:1,n,replace=T))
proddumms <- x[,4] * x[,5]
x <- cbind(x,proddumms)
mx <- x %*% rep(1,6)
y <- mx + rnorm(n,sd=1)
l1 <- lm(y ~ x[,1:5])
p <- regdiag(l1,tail=0.40)
p

The last two predictors in the original data are dummies, and we add a product term for interaction. We then run lm() without the product term, pretending we don’t know about it, to see if regdiag() will notice. Indeed it does:

blog041414b

The dummy variables are V5 and V6, and the plot indeed shows a lot of activity there! Thus the plot indicates a need for adding an interaction term, as it should.

More on freqparcoord

In a previous post, I introduced my new package with Yingkang Xie, freqparcoord. Here I’ll illustrate some of the other uses to which the package can be applied.

The freqparcoord package visualizes multivariate data by plotting the most frequent cases in the data, as defined by multivariate density estimation. The example in the previous post illustrated typical height, weight and age differences by position among baseball players.

But the package allows plotting the least frequent cases–perfect for outlier hunting. Let’s apply this to the baseball data, say finding 3 outliers for each position:

> library(freqparcoord)
> data(mlb)
> freqparcoord(mlb,-3,4:6,7)

Here columns 4:6 are height, weight and age, while column 7 is position; under default settings, the data will be faceted vertically by position. Here is what is displayed:

All variables are centered and scaled. To get a better idea as to what the extreme values are, we can set the keepidxs argument, and then show the original data corresponding to the displayed lines:

> p <- freqparcoord(mlb,-3,4:6,7,keepidxs=4)
> p
> p$xdisp[,c(1,4:7)]
               Name Height Weight   Age PosCategory
237  Ivan_Rodriguez     69    218 35.25     Catcher
994      So_Taguchi     70    163 37.66  Outfielder
674    Julio_Franco     73    188 48.52   Infielder
964     Barry_Bonds     74    228 42.60  Outfielder
36        Toby_Hall     75    240 31.36     Catcher
35  A.J._Pierzynski     75    245 30.17     Catcher
891  Mike_Restovich     76    257 28.16  Outfielder
547      Tony_Clark     79    245 34.71   Infielder
155   C.C._Sabathia     79    290 26.61     Pitcher
275   Richie_Sexson     80    237 32.17   Infielder
559   Randy_Johnson     82    231 43.47     Pitcher
910       Jon_Rauch     83    260 28.42     Pitcher

Sexson was flagged due to his height and weight, while Franco emerged because he is 48 years old!

Our package can also be used for cluster hunting. Here we again look for the most frequent cases, but now locally most frequent, meaning that their estimated density values are local maxima.

Let’s try it on simulated data, with known clusters. We’ll generate from a mixture of 3 bivariate normals, with means at (0,0), (1,2) and (3,3). (The package includes a function rmixmvnorm() for such experiments.) Here are the results, both graphical and text:

 
> cv <- 0.5*diag(2)
> x<- rmixmvnorm(10000,2,3,list(c(0,0),c(1,2),c(3,3)),
+ list(cv,cv,cv))
> p <- freqparcoord(x,m=1,method="locmax",
+ keepidxs=1,k=50,klm=600)
[,1] [,2]
[1,] -0.3556997 0.2423760
[2,] -0.0993228 -0.2510209
[3,] 0.2507786 0.1883437
[4,] 1.1386850 2.2073467
[5,] 2.7581227 2.8935957

blog040814bb

Of course, the results depend on the arguments, with default values being used here. The reader may wish to experiment with other values of klm. But the data clearly fall into 3 clusters, the correct number, centered near (0,0), (1,2) and (3,3), the correct locations. Note that we did NOT specify the number of clusters.

And there’s more! Watch this space for future posts.

The package is on CRAN but the latest version is here.

Adding Annotation to R Objects

When you take a photograph, you can include the date in the image, so you remember when you created it.  (In fact, under EXIF format, it’s stored in the image file anyway, even if it doesn’t appear in the picture.)  Wouldn’t it be nice to make annotations in the objects you create under R?

For example, here is a random forests analysis I did on some Census data:

library(freqparcoord)
data(prgeng)
pg <- prgeng
pggrd <- pg[pg$educ >= 14,]
library(randomForest)
rf1 <- randomForest(wageinc ~ age,data=pggrd)

The R object rf1 here has various components, which you can check via the call names(rf1).

But since rf1 is an S3 object, it is thus just a list, and one can add components to a list object at any time. So, one can type, say,

rf1$daterun <- date()

The point is that the date is now part of the object, and when you save the object to a file, e-mail it to someone else and so on, the date can always be checked.

I can also save my set up code there too. I could for example make a function from my setup code, say using edit(), and then assign it to rf1:

> f
function() {
   library(freqparcoord)
   data(prgeng)
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   library(randomForest)
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)
}
> rf1$setupcode <-f
# check it
> rf1$setupcode
function() {
   library(freqparcoord)
   data(prgeng)
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   library(randomForest)
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)
}

You can do this with other S3 objects returned from R functions, e.g. plot objects of type “gg” returned from ggplot2 operations.

Some “purists” may object to tinkering with objects like this. If you have such an objection, you can create a formal subclass of the given class, e.g. one named “mygg” in the plot example. (If the object is of S4 type, you’ll need to do this anyway.)

The freqparcoord Package for Multivariate Visualization

Recently my student Yingkang Xie and I have developed freqparcoord, a novel approach to the parallel coordinates method for multivariate data visualization.  Our approach:

  • Addresses the screen-clutter problem in parallel coordinates, by only plotting the “most typical” cases, meaning those with the highest estimated multivariate density values. This makes it easier to discern relations between variables.
  • Also allows plotting the “least typical” cases, i.e. those with the lowest density values, in order to find outliers.
  • Allows plotting only cases that are “local maxima” in terms of density, as a means of performing clustering.

The user has the option of specifying that the computation be done parallelized.  (See http://heather.cs.ucdavis.edu/paralleldatasci.pdf for a partial draft of my book, Parallel Computing for Data Science:  with Examples from R and Beyond,  to be published by Chapman & Hall later this year.  Comments welcome.) For a quick intro to freqparcoord, download from CRAN, and load into R.  Type ?freqparcoord and run the examples, making sure to read the comments. One of the examples, whose plot is shown below, involves baseball player data, courtesy of the UCLA Statistics Dept.  Here we’ve plotted the 5 most typical lines for each position.   We see that catchers tend to be shorter, heavier and older, while pitchers tend to be taller, lighter and younger. ItsAllHappeningAtTheZoo

New Blog on R, Statistics, Data Science and So On

Hi, Norm Matloff here. I’m a professor of computer science at UC Davis, and was a founding member of the UCD Dept. of Statistics. You may know my book, The Art of R Programming (NSP, 2011).  I have some strong views on statistics–which you are free to call analytics, data science, machine learning or whatever your favorite term is–so I’ve decided to start this blog.

In my next posting, I’ll discuss my CRAN package with Yingkang Xie,  called freqparcoord.  It’s a new approach to the parallel coordinates method for visualizing multivariate data.