Manifold Visualization: Second Example

In last night’s post, I introduced prVis(), a new visualization tool which we have invented, available in our polyreg package. Recall that prVis() is intended as a simpler alternative to recent visualization tools like t-SNE and UMAP. Here I will post another example.

The dataset is prgeng, included in the package. It consists of wage income, age, gender, and so on, of Silicon Valley programmers and engineers, from the 2000 Census. We first load the data and then choose some of the variables (age, gender, education and occupation):

getPE()
pe1 <- pe[,c(1,2,6:7,12:16)]

So, let’s plot the graph:

The graph consists of streaks, about a dozen of them. What do they represent? To investigate that question, we call another polyreg function:

addRowNums(16,z)

This will write the row numbers of 16 random points from the dataset onto the graph that I just plotted, which now looks like this:

Due to overplotting, the numbers are difficult to read, but are also output to the R console:

[1] “highlighted rows:”
[1] 2847
[1] 5016
[1] 5569
[1] 6568
[1] 6915
[1] 8604
[1] 9967
[1] 10113
[1] 10666
[1] 10744
[1] 11383
[1] 11404
[1] 11725
[1] 13335
[1] 14521
[1] 15462

Rows 2847 and 10666 seem to be on the same streak, so they must have something in common. Let’s take a look.

> pe1[2847,]
         age sex ms phd occ1 occ2 occ3 occ4 occ5
2847 32.3253   1  1   0    0    0    0    0    0
> pe1[10666,]
          age sex ms phd occ1 occ2 occ3 occ4 occ5
10666 45.36755  1  1   0    0    0    0    0    0

Aha! Except for age, these two workers are identical in terms of gender (male), education (Master’s) and occupation (occ. category 6). Now those streaks make sense; each one represents a certain combination of the categorical variables.

Well, then, let’s see what UMAP does:

plot(umap(pe1))

The result is

The pattern here, if any, is not clear.

So in both examples, both last night’s and tonight’s, prVis() was not only simpler but also much more visually interpretable than UMAP.

In fairness, I must point out:

  • I just used the default values of umap() in these examples. It would be interesting to explore other values. On the other hand, it may be that UMAP simply is not suitable for partially categorical data, as we have in this second example.
  • For most other datasets I’ve tried, prVis() and UMAP give similar results.

Even so, these two points show the virtues of using prVis() . We are getting equal or better quality while not having to worry about settings for various hypeparameters.