Adding Annotation to R Objects

When you take a photograph, you can include the date in the image, so you remember when you created it.  (In fact, under EXIF format, it’s stored in the image file anyway, even if it doesn’t appear in the picture.)  Wouldn’t it be nice to make annotations in the objects you create under R?

For example, here is a random forests analysis I did on some Census data:

library(freqparcoord)
data(prgeng)
pg <- prgeng
pggrd <- pg[pg$educ >= 14,]
library(randomForest)
rf1 <- randomForest(wageinc ~ age,data=pggrd)

The R object rf1 here has various components, which you can check via the call names(rf1).

But since rf1 is an S3 object, it is thus just a list, and one can add components to a list object at any time. So, one can type, say,

rf1$daterun <- date()

The point is that the date is now part of the object, and when you save the object to a file, e-mail it to someone else and so on, the date can always be checked.

I can also save my set up code there too. I could for example make a function from my setup code, say using edit(), and then assign it to rf1:

> f
function() {
   library(freqparcoord)
   data(prgeng)
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   library(randomForest)
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)
}
> rf1$setupcode <-f
# check it
> rf1$setupcode
function() {
   library(freqparcoord)
   data(prgeng)
   pg <- prgeng
   pggrd <- pg[pg$educ >= 14,]
   library(randomForest)
   rf1 <<- randomForest(wageinc ~ age,data=pggrd)
}

You can do this with other S3 objects returned from R functions, e.g. plot objects of type “gg” returned from ggplot2 operations.

Some “purists” may object to tinkering with objects like this. If you have such an objection, you can create a formal subclass of the given class, e.g. one named “mygg” in the plot example. (If the object is of S4 type, you’ll need to do this anyway.)

The freqparcoord Package for Multivariate Visualization

Recently my student Yingkang Xie and I have developed freqparcoord, a novel approach to the parallel coordinates method for multivariate data visualization.  Our approach:

  • Addresses the screen-clutter problem in parallel coordinates, by only plotting the “most typical” cases, meaning those with the highest estimated multivariate density values. This makes it easier to discern relations between variables.
  • Also allows plotting the “least typical” cases, i.e. those with the lowest density values, in order to find outliers.
  • Allows plotting only cases that are “local maxima” in terms of density, as a means of performing clustering.

The user has the option of specifying that the computation be done parallelized.  (See http://heather.cs.ucdavis.edu/paralleldatasci.pdf for a partial draft of my book, Parallel Computing for Data Science:  with Examples from R and Beyond,  to be published by Chapman & Hall later this year.  Comments welcome.) For a quick intro to freqparcoord, download from CRAN, and load into R.  Type ?freqparcoord and run the examples, making sure to read the comments. One of the examples, whose plot is shown below, involves baseball player data, courtesy of the UCLA Statistics Dept.  Here we’ve plotted the 5 most typical lines for each position.   We see that catchers tend to be shorter, heavier and older, while pitchers tend to be taller, lighter and younger. ItsAllHappeningAtTheZoo

New Blog on R, Statistics, Data Science and So On

Hi, Norm Matloff here. I’m a professor of computer science at UC Davis, and was a founding member of the UCD Dept. of Statistics. You may know my book, The Art of R Programming (NSP, 2011).  I have some strong views on statistics–which you are free to call analytics, data science, machine learning or whatever your favorite term is–so I’ve decided to start this blog.

In my next posting, I’ll discuss my CRAN package with Yingkang Xie,  called freqparcoord.  It’s a new approach to the parallel coordinates method for visualizing multivariate data.