Back to the BLAS Issue

A few days ago, I wrote here about how some researchers, such Art Owen and Katelyn Gao at Stanford and Patrick Perry at NYU, have been using an old, old statistical technique — random effects models — for a new, new application — recommender systems. In addition to describing their approach to that problem, I also used this setting as an example of my partools package, in particular the parallel computation method I call Software Alchemy (SA). I illustrated SA on some movie ratings data, and got a nice speedup of the lmer() function in the package lme4. The run time for SA was a little over 3 minutes, compared to nearing 20 minutes without.

However, yesterday Doug Bates, one of the developers of lme4, wrote a comment on my blog, reporting that the run time for the same problem on his desktop machine was about 3 minutes, compared to my 20. This was quite intriguing, and I replied that I would investigate. I also mentioned that I would make a new post out of the matter, as I suspected it would be of general interest. Hence the post you are now reading.

I re-ran my code on the same machine I had used earlier, a PC running Linux in our student lab. I’ll skip the specs here, but suffice it to say it is an ordinary machine, about a year old. At any rate, I got basically the same run times as before.

It then occurred to me that Doug would probably have installed a fancy BLAS — the Basic Linear Algebra Subroutines that run R’s matrix ops — on his machine, whereas our student machine would just be running stock R, and therein lay the source of the big time discrepancy. You may recall my posting here on OpenBLAS, one of the fast alternatives to the BLAS version that comes with R.

So, I switched to a machine on which I had installed OpenBLAS. This machine actually is slower than the other one in terms of clock speed, but it does have 16 cores, and a hyperthreading degree of 2, so that 32 threads might profitably run simultaneously. (The first machine was quadcore with hyperthreading degree 2, but as noted, I didn’t have OpenBLAS installed there.) So, this second machine would make a fine testbed for assessing the performance of lme4 under OpenBLAS.

Sure enough, on this nominally slower machine, the problem ran in only about 8 minutes, not 20. And when I ran the top command from another shell window, I saw via the “% CPU” column that indeed many cores were at work, with the number fluctuating in the range 10-32. Remember, running k cores doesn’t necessarily mean a speedup of k, and often we get much less than that, but you can see that running a good BLAS can work wonders for the speed.

Note that, as mentioned briefly in my last post, SA can achieve superlinear speedup in some instances — MORE than speedup k for k cores. As explained in my book, this occurs when the time complexity for n data points is more than O(n). In my lme4 example, there was about a 6X speedup for just 4 cores.




Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems.  (See also a recent paper by P.O. Perry.)

The typical example of a recommender system involves movie review data, such as that in Netflix and MovieLens, with various users rating various movies. We have a matrix, with rows representing users and columns for movies, but it is a very sparse matrix, because most users don’t rate most movies. So, we’d like to guess the missing elements, a process sometimes termed the matrix completion problem.

There are many methods for this, such as matrix factorization, as exemplified in the R package NMF. I much prefer the approach taken by Art and Katelyn, as it is more statistical and thus potentially can provide more insight into the underlying processes.

I was inspired by their research to start a project that takes a new approach to random effects models in general. I’ll explain that later, but let’s first address the major computational problems that can occur with random effects models when applied to big data. Note carefully that not only is this a problem with long run times, but also with memory limitations. For example, Pennell and Dunson wrote, in Fitting Semiparametric Random Effects Models to Large Data Sets, Biostatistics, 2007, “SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.”

One of the features of my partools package is what I call Software Alchemy, with the term alluding to the fact that the procedure turns a statistical algorithm into a statistically equivalent version that is easily parallelized. The idea is simple: Break i.i.d. data into chunks, apply the original algorithm to each chunk, then average the results. It’s easy to prove that the statistical accuracy of this method matches that of the original one, and one not only gets a speedup but also may solve the memory problem, since each chunk is smaller. See my paper on this, to appear soon in the Journal of Statistical Software, but also outlined in my book on parallel computation in data science.  (Chunking has been tried in random effects contexts before; see references in the Perry paper above.)

Here is an example of all this, applied to the MovieLens data, 1-million records version. I’ll use Art and Katelyn’s model here, which writes the rating by user i of movie j as

Yij = μ + ai + bj + εij

where μ is an unknown constant and ai + bj + εij are independent, mean-0 variables (not necessarily normal) with variances σa2, σb2 and σε2. One can also add covariates, but we’ll keep things simple here. In addition to estimating the variances, one can estimate the ai and bj, and thus predict the missing Yij. (See the b component in the output of lmer(), to be discussed below. It lists the estimated ai and bj, in that order.)

Art and Katelyn (as well as Perry) use the Method of Moments rather than the standard methods for random effects models, so that (a) they reduce computation, by deriving explicit solutions and (b) are not tied to assumptions of normality. I use MM in my project too, but here I’ll use the famous R package for random effects models, lme4.  I ran the experiments here on a quad-core machine, using an R parallel cluster of that size.

I first ran without chunking:
> system.time(lmerout <- lmer(V3 ~ (1|V1) +  (1|V2),ml1m))
user system elapsed
1105.129 3.692 1108.946
> getME(lmerout,'theta')
V1.(Intercept) V2.(Intercept)
0.3994240 0.6732281

Not good — almost 20 minutes! So, I turned to partools.The main Software Alchemy function is cabase(). For a given statistical algorithm, one needs to specify the function that implements the algorithm and a function to extract the estimated parameters from the output of the algorithm. There is already a wrapper for lm() in the package, so I just modified it for lmer():


calmer <- function (cls, lmerargs) 
 ovf <- function(u) {
    tmp <- paste("lmer(", lmerargs, ")", collapse = "")
 estf <- function(lmerout) {
 cabase(cls, ovf, estf)

I had earlier used partools function distribsplit() to distribute my data frame m1m to the cluster nodes, in chunks of the same name. So, the call to cabase() does the estimation on each chunk, collects the results, and averages them. Here is what happened:

> system.time(print(calmer(cls,'V3 ~ (1|V1) + (1|V2),ml1m')))
 V2.(Intercept) V1.(Intercept)
[1,] 0.6935344 0.3921150
[2,] 0.6623718 0.3979066
[3,] 0.6505459 0.3967739
[4,] 0.6825985 0.4103681

V2.(Intercept) V1.(Intercept)
 0.6722626 0.3992909

 user system elapsed
 0.812 0.080 189.744

We got an almost 6-fold improvement! And note that we used only 4 cores; this is the superlinear effect that I discuss in my Software Alchemy paper. Just as important, we got essentially the same estimates as from the non-chunked computation. (Please note: The chunked version can be shown to have the same asymptotic variance as the original one; neither is an “approximation” to the other.)

But wait a minute. Didn’t I say that Software Alchemy is intended for i.i.d. data? I did. In the paper I said that the theory could easily be extended to the case of independent but non identically-distributed data, but it would be difficult to state general conditions for this.

Random effects models generally do NOT assume the data are identically distributed. In Art and Katelyn’s paper, for instance, they define indicator variables zij representing that user i rated movie j, and treat these as constants. So, for example, the row counts — numbers of movies rated by each user — are treated as constants.

It’s my contention that such quantities should be treated as random, just like other user behavior. (We haven’t brought in variables such as movie genre here, but if we did, a similar statement would hold.) And once we do that, then voila!, we now have i.i.d. data, and Software Alchemy applies.

In my paper on arXiv,  I also show why it doesn’t matter: The MM estimators are going to turn out essentially indentical, regardless of assuming fixed or random quantities of this type.

Moreover, the advantage of doing things this way goes beyond simply being able to use Software Alchemy.  Those quantities may be of statistical interest in their own right. For instance, consider a study of children within families. In the classical approach, the number of kids in a family in a random effects model would be treated as fixed. But research has shown that family size is negatively correlated with income, and the number of kids should be an important variable — a random variable — in the analysis.

Now, back to MM. The analysis will compute a lot of quantities involving within-user variances and so on. If so, it will help to group together rows or columns of our input data frame, e.g. group by user and group by movie. How might we do this using par tools?

The R function split() comes to mind, but there is an issue of combining the outputs of split(),  which are R lists, from the various cluster nodes. Note in particular that some user, for example, may have data only at one of the nodes. But this is exactly the situation that the partools function addlists() is designed to handle!

That helps, but still we have to worry about overhead incurred by shipping so much data back and forth between the worker nodes and manager node in the cluster. Also, what about memory? Again, we should try (may need) to avoid holding the entire  data set in memory at one time.  The  partools function filesplit() helps in that regard, but this continues to be a computational challenge. It may be that the only good solutions involve C.

A New Method for Statistical Disclosure Limitation, I

The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study.

In this posting, I’ll briefly explain what SDL is, and then describe a new method that Pat Tendick and I are proposing. Our paper is available as arxiv:1510.04406 and R code to implement the method is available on GitHub. See the paper for details.

This is a very difficult problem, one that arguably has not been fully solved, in spite of decades of work by some really sharp people. Some of the common methods are: adding mean-0 noise to each variable; finding pairs of similar records and then swapping their values of the sensitive variables; and (in the case in which all variables are categorical), suppressing cells that contain just 1 or a few cases.

As an example of the noise addition method, consider a patient data set that includes the variables Age and Income. Suppose a nefarious user of the data happens to have external knowledge that James is the oldest patient in the study. The Bad Guy can then issue a query asking for the income of the oldest patient (not mentioning James), thus revealing James’ salary. But if the public version of the data has had noise added, James’s listed income will not be his real one, and he may well not be the oldest listed patient anymore anyway.

Given the  importance of this topic — JSM 2015 had 3 separate sessions devoted to it — it is surprising that rather little public-domain software is available. The only R package I know of is sdcMicro on CRAN (which by the way includes an excellent vignette from which you can learn a lot about SDL).  NISS has the Java-based WebSwap (from whose docs you can also learn about SDL).

Aside from the availability of software, one big concern with many SDL methods is that the multivariate structure of the data may be distorted in the modification process. This is crucial, since most statistical analyses are multivariate in nature, e.g. regression, PCA etc., and thus a major distortion in the multivariate structure can result in seriously misleading estimates.

In the noise addition method, this can be achieved by setting the noise covariance matrix to that of the original data, but for the other methods maintaining the proper multivariate structure is a challenge.

While arguably noise addition works well for data consisting only of continuous variables, and data swapping and cell suppression are often acceptable for the purely-categorical case, the mixed continuous-categorical setting is tough.

Our new method achieves both of the above goals. It (a) is applicable to any kind of data, including the mixed continuous-categorical case, and (b) maintains the correct multivariate structure. Rather counterintuitively, our method achieves (b) while actually treating the variables as (conditionally) independent.

The method has several tuning parameters. In some modern statistical methods, tuning parameters are a real pain, but in SDL, the more tuning parameters the better! The database administrator needs to have as many ways as possible to develop a public form of the database that has both good statistical accuracy and good privacy protection.

As an example, I took some Census data for the year 2000 (5% PUMS), involving programmers in Silicon Valley. In order to simulate an employee database, I sampled 5000 records, keeping the variables WageIncome, Age, Gender, WeeksWorked, MSDegree and PhD. See our paper for details, but here is a quick overview.

First, to see that goal (b) above has been maintained reasonably well, I ran a linear regression analysis, predicting WageIncome from the other variables. I did this twice, once for the original data and once for the modified set, for a given combination of values of the tuning parameters. Here are the estimated regression coefficients:

data Age Gender WeeksWorked MS PhD
orig. 447.2 -9591.7 1286.4 17333.0 21291.3
modif. 466.1 -8423.2 1270.7 18593.9 22161.4


This is not bad. Each pair of coefficients is within one original standard error of the other (not shown). The database administrator could try lots of other combinations of the tuning parameters, and likely get even closer. But what about privacy?

In the original data set, there was exactly one female worker with age under 31:

> p1[p1$sex==2 & p1$phd==1 & p1$age < 31,]
          age sex wkswrkd ms phd wageinc
7997 30.79517 2 52 0 1 100000

Which such workers, if any, are listed in the modified data?

> p1pc  p1pc[p1pc$sex==2 & p1pc$phd==1 &         p1pc$age < 31,]
           age sex wkswrkd ms phd wageinc
12522 30.5725   2      52  0   1   50000

There is only one person listed in the released data of the given description (female, PhD, age under 31). But she is listed as having an income of $50,000 rather than $100,000. In fact, it is a different person, worker number 12522, not 7997. (Of course, ID numbers would be suppressed.)

So what happened to worker number 7997?
> which(rownames(p1p) == 7997)
[1] 3236
> p1p[3236,]
age sex wkswrkd ms phd wageinc
7997 31.9746 1 52 0 1 100000

Ah, she became a man! That certainly hides her. Under another luck of the draw, her record may have become all NA values.

In this way, the database administrator can set up a number of statistical analysis test cases, and a number of records at high risk of identification, and then try various combinations of the tuning parameters in order to obtain a modified data set that achieves a desired balance between statistical fidelity and privacy.

Unbalanced Data Is a Problem? No, BALANCED Data Is Worse

Say we are doing classification analysis with classes labeled 0 through m-1. Let Ni be the number of observations in class i. There is much handwringing in the machine learning literature over situations in which there is a wide variation among the Ni. I will argue here, though, that the problem is much worse in the case in which there is — artificially — little or no variation among those sample sizes.

To simplify matters, in what follows I will take m = 2, with the population class probabilities denoted by p and 1-p. Let Y be 1 or 0, according to membership in Class 1 or 0, and let X be the vector of v predictor variables.

First, what about this problem of lack of balance? If your data are a random sample from the target population, then the lack of balance is natural if p is near 0 or 1, and there really isn’t much you can do about it, short of manufacturing data. (Some have actually proposed that, in various forms.) And with a parametric model, say a logit, you may do fairly well if the model is pretty accurate over the range of X. To be sure, the lack of balance may result in substantial within-class misclassification rates even if the overall rate is low. One can try different weightings and the like, but one is pretty much stuck with it.

But at least in this unbalanced situation, you will get consistent estimators of the regression function P(Y = 1 | X = t), as the sample size grows. That’s not true for what I will call the artificially balanced case. Here the Nare typically the same or nearly so, and arise from our doing separate samplings of each of the classes. Clearly we cannot estimate p in this case, and it matters. Here’s why.

By an elementary derivation, we have that (at the population level)

P(Y | X = t) = 1 / (1 + [(1-p)/p] [f(t)/g(t)])    Eqn. (1)

where f and g are the densities of X within Classes 0 and 1.  Consider the logistic model. Equation (1) implies that

β0 + β1 t1 + … + βv tv = -ln[(1-p)/p] – ln[f(t)/g(t)]    Eqn. (2)

From this you can see that

β0 = -ln[(1-p)/p],   Eqn. (3)

which in turn implies that if the sample sizes are chosen artificially, then our estimate of  βin the output of R’s glm() function (or any other code for logit) will be wrong. If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.

The remedy is easy, though. Equation (2) shows that even with the artificially balanced sampling scheme, our estimates of βi WILL be consistent for i > 0 (since the within-class densities of X won’t change due to the sampling scheme). So, if we have an external estimate of p, we can just substitute it in Equation (3) to get the right intercept term, and then happily do our future classifications.

As an example, consider the UCI Letters data set. There, the various English letters have approximately equal sample sizes, quite counter to what we know about English. But there are good published sources for the true frequencies.

Now, what if we take a nonparametric regression approach? We can still use Equation (1) to make the proper adjustment.  For each t at which we wish to predict class membership, we do the following:

  • Estimate the left-hand side (LHS) of (1) nonparametrically, using any of the many methods on CRAN, or the version of kNN in my regtools package.
  • Solve for the estimated ratio f(t)/g(t).
  • Plug back into (1), this time with the correct value of (1-p)/p from the external source, now providing the correct value of the LHS.

More on the Heteroscedasticity Issue

In my last post, I discussed R software, including mine, that handles heteroscedastic settings for linear and nonlinear regression models. Several readers had interesting comments and questions, which I will address here.

To review: Though most books and software assume homoscedasticity, i.e. constancy of the variance of the response variable at all levels of the predictor variables, this condition seldom if ever holds in real life. This can make a mockery of the reported standard errors, and thus render confidence intervals and tests rather meaningless. But fortunately, there is asymptotic theory that gives use valid standard errors in the heteroscedastic case, due to Eickert, White and others, for the linear case. In R, this is available via the functions vcovHC() and hccm() in the CRAN packages sandwich and carrespectively.

I mentioned that I had adapted the E-W idea to the nonlinear realm, leveraging the linear-approximation core of the Gauss-Newton method for nonlinear least-squares computation, by calling vcovHC() on the linear approximation. The result is the function nlshc() in my package regtools. (Unfortunately, I erroneously used its old name, nlsvcovhc() in my post.) One simply calls nlshc() on an object of class ‘nls’ (output of R’s nls() function) to obtain standard errors that are valid under heteroscedasticity.

Achim Zeileis, one of the authors of sandwich, then made some comments on my blog post. He pointed out that one can actually call sandwich() directly on an ‘nls’ object (again, due to the linear approximation). After reading my post, he tried that, and found that the resulting standard errors were more realistic than what one gets directly from nls(), but they still were not as good as mine, where “good” means correct confidence levels for confidence intervals and so on. He speculated that this was due to the particular variant of E-W that I use. Well, here is an update.

First, Achim’s speculation was right. I tried some of the other variants of E-W, and they didn’t do as well. Of course, there may be other factors at work as well.

Upon further investigation, I found another intriguing phenomenon. I tried nlshc() on one of the NIST datasets, as follows:

regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3))
bstart <- list(b1=1,b2=1, b3=1)
z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,
   start=list(b1=1,b2=1, b3=1))

To my surprise, the output was an all-NAs matrix. This did not happen with hccm(), This is strange, as vcovHC() works fine otherwise, e.g. in the simulation I posted last time.

So, for the time being, in the new nlshc() I have replaced the call to vcovHC() by a call to hccm(). I’ve also added an option for the user to choose a different E-W variant, though given what we know so far, I don’t recommend it.

A couple of commenters asked about using alternatives to nls(), which in fact I had done with nlsLM(). I had done that because I had had some convergence problems with nls(), but actually have reverted to nls() in the new version of the package. (I will also make my function an S3 method later on, which I had originally decided not to do, but have changed my mind with Achim’s encouragement :-) )

Can You Say “Heteroscedasticity” 3 Times Fast?

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can’t be true, even approximately.

Typical books discuss assessment of that assumption using residual plots and the like, but then leave it at that. Rather few books mention Eicker-White theory, which develops valid asymptotic inference for heteroscedastic data. E-W is really nice, and guess what! — it’s long been available in R, in the sandwich and car packages on CRAN.  (Note that the latter package is intended for use with a book that does cover this topic, J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011.) Then instead of using R’s standard vcov() function to obtain estimated variances and covariances of the estimates of the βi, we use vcovHC() and hccm(), respectively.

One can make a similar derivation for nonlinear regression, which is available as the function nlshc() in my regtools package. The package will be associated with my own book on regression, currently in progress. (The package is currently in progress as well, but already contains several useful functions.) The rest of this post is adapted from an example in the book.

The model I chose for this simple example was E(Y | X = t) = 1 / t’β, where the distributions of the quantities can be seen in the following simulation code:

sim <- function(n,nreps) {
 b <- 1:2
 res <- replicate(nreps,{
 x <- matrix(rexp(2*n),ncol=2)
 meany <- 1 / (x %*% b)
 y <- meany + (runif(n) - 0.5) * meany
 xy <- cbind(x,y)
 xy <- data.frame(xy)
 nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2),
 data=xy,start=list(b1 = 1,b2=1)) 
 b <- coef(nlout)
 vc <- vcov(nlout)
 vchc <- nlsvcovhc(nlout)
 z1 <- (b[1] - 1) / sqrt(vc[1,1])
 z2 <- (b[1] - 1) / sqrt(vchc[1,1])
 print(mean(res[1,] < 1.28))
 print(mean(res[2,] < 1.28))

The results were striking:

> sim(250,2500)
[1] 0.6188
[1] 0.9096

Since the true value should be 0.90 (1.28 is the 0.90 quantile of N(0,1)), the Eicker-White method is doing an outstanding job here — and R’s built-in nonlinear regression function, nls() is not. The latter function’s reported standard errors are way, way off the mark.

New R Software/Methodology for Handling Missing Dat

I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu.

There is a long history of development of techniques for handling missing data. See the famous book by Little and Rubin (currently second edition, third due out in December). The main methods in use today fall into two classes:

  • Complete-cases (CC): (Also known at listwise deletion.) This approach is simple — just delete any record for which at least one of the variables has a missing (NA, in R) value.
  • Multiple imputation (MI): These methods involve estimating the conditional distribution of a missing variable from the others, and then sampling from that distribution via simulation. Multiple alternate versions of the data matrix are generated, with the NA values replaced by values that might have been the missing one.

In our work, we revisited, and broadened the scope of, another class of methods, which had been considered in the early years of missing-data research but pretty much abandoned for reasons to be explained shortly:

  • Available cases (AC): Also known as pairwise deletion.) If the statistical method involves computation involving, say, various pairs of variables, include in such a calculation any observation for which this pair is intact, regardless of whether the other variables are intact. The same holds for triples of variables and so on.

The early work on AC involved linear regression analysis. Say we are predicting a scalar Y from a vector X. The classic OLS estimator is (U’U)-1U’V, where U is the matrix of X values and V is the vector of Y values in our data. But if we center our data, that expression consists of the inverse of the sample covariance matrix of X, times the sample covariance of X and Y.

The key point is then that covariances only involve products of pairs of variables. As a simple example, say we are predicting human weight from height and age. Under AC, estimation of the covariance between weight and height can be done by using all records in the data for which the weight and height values are intact, even if age is missing. AC thus makes more thorough use of the data than does CC, and thus AC should be statistically more accurate than CC.

However, CC and AC make more stringent assumptions (concerning the mechanism underlying missingness) than does MI. Hence the popularity of MI. For R, for instance, the packages mi, mice and Amelia and others handle missing data in general,

We used Amelia as our representative MI method. Unfortunately, it is very long-running. In a PCA computation that we ran, for example, CC and AC took 0.0111 and 1.967 seconds, respectively, while MI had a run time of  92.928 seconds. And statistically, it was not performing any better than CC and AC, so we did not include it in our empirical investigations, though we did analyze it otherwise.

Our experiments involved taking real data sets, then randomly inserting NA values, thus generating many versions of the original data. One of the data sets, for instance, was from the 2008 census, consisting of all programmers and engineers in Silicon Valley. (There were about 20,000 in the PUMS sample. This data set is included in the regtools package.) The following table shows the variances of CC and AC estimates of the first regression coefficient (the means were almost identical, essentially no bias):

NA rate CC var. AC var.
0.01 0.4694873 0.1387395
0.05 2.998764 0.7655222
0.10 8.821311 1.530692

As you can see, AC had much better accuracy than CC on this real data set, and in fact was better than CC on the other 3 real data sets we tried as well.

But what about the famous MCAR assumption underlying CC and AC, which is stricter than the MAR assumption of MI methods? We argue in the paper (too involved to even summarize here) that this may be much less of an issue than has been supposed.

One contribution of our work is to extend AC to non-covariance settings, namely log-linear models.

Please try the software (in the functions lmac(), pcac() and loglinac() in the package), and let me know your thoughts.


Get every new post delivered to your Inbox.

Join 145 other followers