It’s been a while since I’ve posted here, not for lack of material but just having too long a TO DO list. I’ve added some important features to my regtools package, for instance, and hope to discuss them here soon.
For the present post, though, I will discuss FOCI, a highly promising new approach to feature selection, developed by Mona Azadkia and Sourav Chatterjee of the Stanford Stat Dept. There is now an R package of the same name on CRAN.
I am one of the authors of the package, but was not involved in developing the method or the associated theory. My connection here was accidental. A few months ago, Sourav gave a talk on FOCI at Stanford. I must admit that I had been skeptical when I read the abstract. I had seen so many proposals on this topic, and had considered them impractical, full of unverifiable assumptions and rather opaque indications regarding asymptotic behavior. So, I almost didn’t attend, even though I was on the Stanford campus that day.
But I did attend, and was won over. The method has only minimal assumptions, explicit asymptotic behavior, and lo and behold, no tuning parameters.
I tried FOCI on some favorite datasets that evening, and was pleased with the results, but the algorithm was slow even on moderate-sized datasets. So I wrote to Sourav, praising the method but suggesting that they parallelize the software. He then asked if I might get involved in that, which I did.
You can find Mona and Sourav’s arXiv paper at https://arxiv.org/pdf/1910.12327.pdf, which extends earlier work by Sourav. The theory is highly technical, I’d say very intricate even by math stat standards, thus not for the faint of heart. Actually, it was amusing during Sourav’s talk when one of the faculty said, “Wait, don’t leave that slide yet!”, as the main estimator (T_n, p.3) shown on that slide, is a challenge to interpret.
The population-level quantity, Eqn. (2.1), is more intuitive. Say we are predicting Y from a vector-valued X, and are considering extending the feature set to (X,Z). The expression is still a bit complex — and please direct your queries to Mona and Sourav for details, not to me 🙂 — but at its core it is comparing Var(Y | X,Z) to Var(Y | X). How much smaller is the former than the latter? Of course, there are expectations applied etc., but that is the intuitive heart of the matter.
So here is a concrete example, the African Soil remote sensor dataset from Kaggle. After removing ID, changing a character column to R factor then to dummies, I had a data frame with n = 1157 and p = 3578. So p > n, with p being almost triple the size of n, a nice test for FOCI. For Y, there are various outcome variables; I chose pH.
The actual call is simple:
fout <- foci(afrsoil[,3597],afrsoil[,1:3578])
This specifies the outcome variable and features. There are some other options available. The main part of the output is the list of selected variables, in order of importance. After that, the cumulative FOCI correlations are shown:
> fout$selectedVar index names 1: 3024 m1668.14 2: 3078 m1564 3: 2413 m2846.45 4: 3173 m1380.8 5: 3267 m1199.52 6: 1542 m4526.16 7: 2546 m2589.96 8: 3537 m678.828 9: 3469 m809.965 10: 2624 m2439.54 11: 1874 m3885.9 12: 2770 m2157.98 > fout$stepT[1:12]  0.1592940 0.3273176 0.4201255 0.5271957 0.5642336 0.5917753 0.6040584  0.6132360 0.6139810 0.6238901 0.6300294 0.6303839
Of course, those feature names, e.g. ‘m1564’, are meaningful only to a soil scientist, but you get the idea. Out of 3578 candidate features, FOCI chose 12, attaining very strong dimension reduction. Actually, the first 8 or so form the bulk of the selected feature set, with the remainder yielding only marginal improvements.
But is it a GOOD feature set? We don’t know in this case, but Mona and Sourav have a very impressive example in the paper, in which FOCI outperforms random forests, LASSO, SCAD and so on. That, combined with the intuitive simplicity of the estimator etc., makes FOCI a very promising addition to anyone’s statistical toolkit. I’m sure more empirical comparisons are forthcoming.
And what about the computation speed? Two types of parallelism are available, with best results generally coming from using both in conjunction: 1. At any step, the testing of the remaining candidate features can be done simultaneously, on multiple threads on the host machine. 2. On a cluster or in the cloud, one can ask that the data be partitioned in chunks of rows, with each cluster node applying FOCI to its chunk. The union of the results is then formed, and fed through FOCI one more time to adjust the discrepancies. The idea is that that last step will not be too lengthy, as the number of candidate variables has already been reduced. A cluster size of r may actually produce a speedup factor of more than r (Matloff 2016).
Approach (2) also has the advantage that it gives the
user an idea of the degree of sampling variability in the FOCI
So there you have it. Give FOCI a try!
6 thoughts on “FOCI: a new method for feature selection”
Great post, thank you! Sounds highly interesting.
I replicated the results from Table 1 of the underlying paper and played a bit using different seeds and bootstrapped samples. For example, setting a seed of 9991:
n <- 2000
p <- 1000
x <- matrix(rnorm(n * p), nrow = n)
colnames(x) <- paste0(rep("x", p), seq(1, p))
y <- x[, 1] * x[, 2] + sin(x[, 1] * x[, 3])
The foci output to foci(y, x) is
1: 742 x742
2: 1 x1
3: 2 x2
4: 3 x3
 0.07392077 0.10628928 0.46329687 0.77538844
There are several other examples where foci is not able to detect 1, 2, 3 as the only relevant predictors (try different seeds, eg 999, 9992). The paper states "If computational time is not an issue, one can try to add m ≥ 2 variables at each step instead of just one". Using this strategy, I believe, foci would detect the variables 1, 2, 3 as the only ones. The computational run time remains a possible issue. But due to the implemented parallel processing this idea could be feasible and may indeed be something for a future update of the package.
Thanks for the comments. Don’t lose sight of the fact that this is statistics, working with samples. FOCI is subject to sampling variability. In the given example, FOCI will not come up with the correct set of variables in each and every possible sample, and I’ve encountered samples myself in which it didn’t. Note the comment re sampling variability in the foci() man page.
The idea of bringing in a group of variables at each step is good, but presents some real challenges, since there are O(n^2) pairs, O(n^3) triples and so on. It would appear to be feasible only if one has predefined groups of variables in mind, as in Group LASSO. Even the FOCI theory paper implicitly assumes that, I believe, and its application with predefined groups would seem to be limited.
Thanks for the quick reply and thoughts. I actually had only 2 variables for the first step in mind. I agree, though, that this can still be a real challenge (and it might be a somewhat arbitrary choice).
not able to install the package FOCI for trying out
package ‘foci’ is not available (for R version 3.5.2) .. Is the error message I got .. How to get this fixed..
The package requires R >= 3.6. I believe it would be installable on 3.5.2 if you download the source and build it yourself. Otherwise, you’ll need to upgrade.