R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is “more CS-ish than Python,” just to provoke my fellow computer scientists. 🙂

But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive.  My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:

 

> x <- c(5,12,13,9,12,5,6)
> xuc <- as.character(unique(x))
> xuc
[1] "5" "12" "13" "9" "6"
> xLookup <- 1:length(xuc)
> names(xLookup) <- xuc
> xLookup
5 12 13 9 6
1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']
12
2

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote

 ul[as.character(testSet[,1])])

Lots of other ways to do this, of course, but it shows how handy the names can be.

Example of Overfitting

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as  it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:

 

> set.seed(88888)
> getPE()
> pe1 <- pe[,c(1,2,4,6,7,12:16,3)]
> testidxs <- sample(1:nrow(pe1),1000)
> testset <- pe1[testidxs,]
> trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset)
> predvals <- predict(lmout,testset[,-11])
> mean(abs(predvals - testset[,11]))
[1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2)
> mean(abs(predict(pfout,testset[,-11]) 
   - testset[,11]))
[1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg MAPE # terms
3 23951.69 118
4 23974.76 226
5 24340.85 371
6 24554.34 551
7 36463.61 767
8 74296.09 1019

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.