But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Today I was dealing with a problem of ID numbers that are nonconsecutive.Â My solution was to set up a lookup table. Say we have ID data (5, 12, 13, 9, 12, 5, 6). There are 5 distinct ID values, so we’d like to map these into new IDs 1,2,3,4,5. Here is a simple solution:

Â

> x <- c(5,12,13,9,12,5,6)

> xuc <- as.character(unique(x))

> xuc

[1] "5" "12" "13" "9" "6"

> xLookup <- 1:length(xuc)

> names(xLookup) <- xuc

> xLookup

5 12 13 9 6

1 2 3 4 5

So, from now on, to do the lookup, I just use as subscript the character from of the original ID, e.g.

> xLookup['12']

12

2

Of course, I did all this within program code. So to change a column of IDs to the new ones, I wrote

Â ul[as.character(testSet[,1])])

Lots of other ways to do this, of course, but it shows how handy the names can be.

The dataset is **prgeng**, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our **polyreg** package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:

> set.seed(88888) > getPE() > pe1 <- pe[,c(1,2,4,6,7,12:16,3)] > testidxs <- sample(1:nrow(pe1),1000) > testset <- pe1[testidxs,] > trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset) > predvals <- predict(lmout,testset[,-11]) > mean(abs(predvals - testset[,11])) [1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2) > mean(abs(predict(pfout,testset[,-11]) - testset[,11])) [1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg | MAPE | # terms |

3 | 23951.69 | 118 |

4 | 23974.76 | 226 |

5 | 24340.85 | 371 |

6 | 24554.34 | 551 |

7 | 36463.61 | 767 |

8 | 74296.09 | 1019 |

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.

]]>The dataset is prgeng, included in the package. It consists of wage income, age, gender, and so on, of Silicon Valley programmers and engineers, from the 2000 Census. We first load the data and then choose some of the variables (age, gender, education and occupation):

getPE() pe1 <- pe[,c(1,2,6:7,12:16)]

So, let’s plot the graph:

The graph consists of streaks, about a dozen of them. What do they represent? To investigate that question, we call another **polyreg** function:

addRowNums(16,z)

This will write the row numbers of 16 random points from the dataset onto the graph that I just plotted, which now looks like this:

Due to overplotting, the numbers are difficult to read, but are also output to the R console:

[1] “highlighted rows:”

[1] 2847

[1] 5016

[1] 5569

[1] 6568

[1] 6915

[1] 8604

[1] 9967

[1] 10113

[1] 10666

[1] 10744

[1] 11383

[1] 11404

[1] 11725

[1] 13335

[1] 14521

[1] 15462

Rows 2847 and 10666 seem to be on the same streak, so they must have something in common. Let’s take a look.

> pe1[2847,] age sex ms phd occ1 occ2 occ3 occ4 occ5 2847 32.3253 1 1 0 0 0 0 0 0 > pe1[10666,] age sex ms phd occ1 occ2 occ3 occ4 occ5 10666 45.36755 1 1 0 0 0 0 0 0

Aha! Except for age, these two workers are identical in terms of gender (male), education (Master’s) and occupation (occ. category 6). Now those streaks make sense; each one represents a certain combination of the categorical variables.

Well, then, let’s see what UMAP does:

plot(umap(pe1))

The result is

The pattern here, if any, is not clear.

So in both examples, both last night’s and tonight’s, **prVis()** was not only simpler but also much more visually interpretable than UMAP.

In fairness, I must point out:

- I just used the default values of
**umap()**in these examples. It would be interesting to explore other values. On the other hand, it may be that UMAP simply is not suitable for partially categorical data, as we have in this second example. - For most other datasets I’ve tried,
**prVis()**and UMAP give similar results.

Even so, these two points show the virtues of using **prVis()** . We are getting equal or better quality while not having to worry about settings for various hypeparameters.

But the purpose of this blog post is to focus on one particular new feature, a visualization tool. Over the years a number of “nonlinear” methods generalizing Principal Components Analysis (PCA) have been proposed, such as ICA and KPCA some time ago, and more recently t-SNE and UMAP.

I’ve long felt that applying PCA to “polynomial-ized” versions of one’s data should do well too, and of course much more simply, a major virtue. So, nowÂ that we have machinery, the **polyreg** package, to conveniently build multivariate polynomial models — including for categorical variables, an important case — I developed a new function in **polyreg**, named **prVis()**. I’ll illustrate it in this blog post, and compare it to UMAP.

A popular example in the “manifold visualization” (MV) business is the Swiss Roll model, which works as follows: A 4-component mixture of bivariate normals is generated, yielding a 2-column data frame whose column names are ‘x’ and ‘y’. Now derive from that a 3-column data frame, consisting triples of the form (x cos (x), y, x sin(x)).

The original (i.e. 2-column) data looks like this:

Here is the goal:

Using one of the MV methods on the 3-column data, and not knowing that the 2-column data was a 4-component mixture, could a person discern that latter fact?

We’ll try UMAP (said to be faster than t-SNE) and **prVis()**. But what UMAP implementation to use? Several excellent ones are available for R, such as umapr, umap and uwot.Â I use the last one, as it was the easiest for me to install, and especially because it includes a **predict()** method.

Here is the code. We read in the data from a file included in the **polyreg** package into **sw**, and change the mode of the last column to an R factor. (Code not shown.) The data frame sw actually has 4 columns, the last being the mixture component ID, 1-4.

We first try PCA:

That swirl is where the Swiss Roll model gets its name, and clearly, the picture gives no hint at all as to how many components were in the original data. The model was constructed in such a way that PCA would fail.

So let’s try UMAP.

plot(umap(sw[,-4]))

So, how many components are there? (Remember, we’re pretending we don’t know it’s 4.) On the left, for instance, does that loop consists of just1 component? 2? 3? We might go for 1. Well, let’s un-pretend now, and use the component labels, color-coded:

plot(umap(sw[,-4]),col=sw[,4])

Wow, that “loop” actually contained 2 of the original components, not just 1. We were led astray.

Let’s try **prVis()**.

Still rather swirly, but it definitely suggests 4 components. Let’s check, revealing the color-coded actual components:

prVis(sw,labels=TRUE)

Very nice! Each of the 4 apparent components really did correspond to an actual component.

Of course, usually the difference is not this dramatic, but in applying **prVis()** to a number of examples, I’ve found that it does indeed do as well as, or better than, UMAP and t-SNE, but in a simpler, more easily explained manner, as noted a major virtue.

I’ll post another example, showing an interesting interpretation of a real dataset, in the next day or so.

]]>All this involves a trick one can employ while working in R’s top level from interactive mode, the familiar > prompt.

I’ll use as an toy example here. Let’s say I have some counting variable **x** that I need to increment occasionally as I work. Of course, the straightforward way to do this is

x <- x + 1

But to save typing and reduce distraction from my main thought processes, it would be nice if I were able to simply type

ix

Of course, the straightforward way to do this would be to define a function **ix()** (“increment x”),

ix <- function() x <<- x + 1

and then call it each time by typing

ix()

But again, I want to save typing, and don’t want to type the parentheses. How can I arrange this?

My approach here will be to exploit the fact that in R’s interactive mode, typing an expression will print the value of that expression. If I type, say

y

R will first determine the class of **y**, and invoke the print method for that class. If I write that method myself, I can put any R code in there that I wish, such as code to increment **x** above! So here goes:

> w <- list(y=3) > class(w) <- 'ix' > print.ix <- function(ixObject) x <<- x + 1 > x <- 88 > w > x [1] 89 > w > x [1] 90

I set up an S3 class **‘ix’**, including a print method **print.ix()**. I created **w**, an instance of that class, and as you can see, each time I typed ‘w’, **x** did get incremented by 1.

What just happened? When I type ‘w’, the R interpreter will know that I want to print that variable. The interpreter finds that **w** is of class ‘ix’, so it calls the print method for that class, **print.ix()**, which actually doesn’t do any printing; it merely increments **x**, as desired.

So I don’t even need to type out even the 4 characters ‘ix()’, or even the 2 characters ‘ix’; just typing the single character ‘w’ suffices. A small thing, maybe, but very useful to me when, in the frenzy of code development and especially debugging, I am able to keep distractions from my train of thought to a minimum.

By the way, we have been doing further development on our **polyreg** package, with some interesting new features. More news on that soon.

- Though originally we had made the disclaimer that we had not yet done any experiments with image classification, there were comments along the lines of “If the authors had included even one example of image classification, even the MNIST data, I would have been more receptive.” So our revision does exactly that, with the result that polynomial regression does well on MNIST even with only very primitive preprocessing (plain PCA).
- We’ve elaborated on some of the theory (still quite informal, but could be made rigorous).
- We’ve added elaboration on other aspects, e.g. overfitting.
- We’ve added a section titled, “What This Paper Is NOT.” Hopefully those who wish to comment without reading the paper (!) this time will at least read this section.
- Updated and expanded results of our data experiments, including more details on how they were conducted.

We are continuing to add features to our associated R package, **polyreg**. More news on that to come.

Thanks for the interest. Comments welcome!

]]>A summary of the paper is:

- We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
- NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
- One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
- NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
- We investigated this on a wide variety of datasets, and found that
**in every case PR did as well as, and often better than, NNs**. - We have developed a feature-rich R package,
**polyreg**, to facilitate using PR in multivariate settings.

Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.

Also available are the slides for our presentation at GRAIL on this project.

]]>On the first day of the conference, one of the session chairs announced that a complaint had been made by the group R-Ladies, concerning the fact that all of the talks were given by men. The chair apologized for that, and promised efforts to remedy the situation in the future. Then on the second day, room was made in the schedule for two young women from R-Ladies to make a presentation. There also was a research paper presented by a woman, added at the last minute; she had presented work at the conference in the past.

I have been interested in status-of-women issues for a long time, and I spoke briefly with one of the R-Ladies women after the session. I suggested that she read a blog post I had written that raised some troubling related issues.

But I didn’t give the matter much further thought until Tuesday of this week, when a friend asked me about the “highly eventful” conference. That comment initially baffled me, but it turned out that he was referring to the R-Ladies controversy, which he had been following in the “tweetstorm” on the issue in #rfinance2018Â . Not being a regular Twitter user, I had been unaware of this.

Again, issues of gender inequity (however defined) have been a serious, proactive concern of mine over the years. I have been quite active in championing the cases of talented female applicants for faculty positions at my university, for instance. Of my five current research students, four are women. In fact, one of them, Robin Yancey, is a coauthor with me of the **partools** package that played a prominent role in my talk at this conference.

That said, I must also say that those tweets criticizing the conference organizers were harsh and unfair. As that member of the program committee pointed out, other than keynote speakers, the program is comprised of papers submitted for consideration by potential authors, and it turned out that no papers had been submitted by women. Many readers of those tweets will think that the program committee is prejudiced against women, which I really doubt is the case.

The women who complained also cited lack of a Code of Conduct for the conference. This too turned out to be a misleading claim, as there had been a Code of Conduct posted by the University of Illinois at Chicago, the host of the conference.

So, apparently there was no error of **co**mmission here, but some may feel an error of **o**mission did occur. Arguably any conference should make more proactive efforts to encourage female potential authors to submit papers for consideration in the program. Many conferences have invited talks, for instance, and R/Finance may wish to consider this.

However, there is, as is often the case, an issue of breadth of the pool. Granted, things like applicant pools are often used as excuses by, for example, employers for significant gender imbalances in their workforces. But as far as I know, the current state of affairs is:

- The vast majority of creators (i.e. ‘cre’ status) of R packages in CRAN etc. are men.
- The authors of the vast majority of books involving R are men.
- The authors of the vast majority of research papers related to R are men.

It is these activities that lead to giving conference talks, and groups like R-Ladies should promote more female participation in them. We all know some outstanding women in those activities, but to truly solve the problem, many more women need to get involved.

(Some material here was updated on July 21, 2018.)

]]>I can relate to his comments personally, and indeed he has written the essay that I never had the courage to write about myself. But the big message in Yihui’s posting is that, really, that MP degree of his is far more useful than his PhD. If Yihui had been the Tiger Cub type (child of a Tiger Mom), we wouldn’t have **knitr**, and a lot more.

I was a strong opponent of Tiger Mom-ism long before Amy Chua coined the term. To me, it is highly counterproductive, destroying precious creativity and often causing much misery. I’m not endorsing laziness, mind you,, but as Yihui shows, creative procrastination can produce wonderful results. As I write at the above link,

I submit that innovative people tend to be dreamers. Iâ€™m certainly not advocating that parents raise lazy kids, but all that intense regimentation in Tiger Mom-land clearly gives kids no chance to breathe, let alone dream.

Yihui is a dreamer, and the R community is much better for it.

I could tell Yihui is exceptionally creative the first day I met him. Who else would have the chutzpah to name his Web site The Capital of Statistics?

As mentioned, it was quite courageous on Yihui’s part to write his essay, but he is doing a public good in doing so; many, I’m sure, will find it inspirational.

Good for him, and good for R.

]]>

Hadley also gave an interesting talk, “An introduction to tidy evaluation,” involving some library functions that are aimed at writing clearer, more readable R. The talk came complete with audience participation, very engaging and informative.

The venue was GRAIL, a highly-impressive startup. We will be hearing a lot more about this company, I am sure.

]]>