Our arXiv paper and the associated R package polyreg caused a bit of a stir, both pro and con, when we first announced them here in June. The discussion even spread as far as Twitter, Reddit and Hacker News. We’ll be announcing a revised paper, and various new features to the package, very soon.
But the purpose of this blog post is to focus on one particular new feature, a visualization tool. Over the years a number of “nonlinear” methods generalizing Principal Components Analysis (PCA) have been proposed, such as ICA and KPCA some time ago, and more recently t-SNE and UMAP.
I’ve long felt that applying PCA to “polynomial-ized” versions of one’s data should do well too, and of course much more simply, a major virtue. So, now that we have machinery, the polyreg package, to conveniently build multivariate polynomial models — including for categorical variables, an important case — I developed a new function in polyreg, named prVis(). I’ll illustrate it in this blog post, and compare it to UMAP.
A popular example in the “manifold visualization” (MV) business is the Swiss Roll model, which works as follows: A 4-component mixture of bivariate normals is generated, yielding a 2-column data frame whose column names are ‘x’ and ‘y’. Now derive from that a 3-column data frame, consisting triples of the form (x cos (x), y, x sin(x)).
The original (i.e. 2-column) data looks like this:
Here is the goal:
Using one of the MV methods on the 3-column data, and not knowing that the 2-column data was a 4-component mixture, could a person discern that latter fact?
We’ll try UMAP (said to be faster than t-SNE) and prVis(). But what UMAP implementation to use? Several excellent ones are available for R, such as umapr, umap and uwot. I use the last one, as it was the easiest for me to install, and especially because it includes a predict() method.
Here is the code. We read in the data from a file included in the polyreg package into sw, and change the mode of the last column to an R factor. (Code not shown.) The data frame sw actually has 4 columns, the last being the mixture component ID, 1-4.
We first try PCA:
That swirl is where the Swiss Roll model gets its name, and clearly, the picture gives no hint at all as to how many components were in the original data. The model was constructed in such a way that PCA would fail.
So let’s try UMAP.
So, how many components are there? (Remember, we’re pretending we don’t know it’s 4.) On the left, for instance, does that loop consists of just1 component? 2? 3? We might go for 1. Well, let’s un-pretend now, and use the component labels, color-coded:
Wow, that “loop” actually contained 2 of the original components, not just 1. We were led astray.
Let’s try prVis().
Still rather swirly, but it definitely suggests 4 components. Let’s check, revealing the color-coded actual components:
Very nice! Each of the 4 apparent components really did correspond to an actual component.
Of course, usually the difference is not this dramatic, but in applying prVis() to a number of examples, I’ve found that it does indeed do as well as, or better than, UMAP and t-SNE, but in a simpler, more easily explained manner, as noted a major virtue.
I’ll post another example, showing an interesting interpretation of a real dataset, in the next day or so.