How About a “Snowdoop” Package?

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package, sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files, words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

 Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Here is our code:

 
# give each node in the cluster cls an ID number 
assignids <- function(cls) {    
   clusterApply(cls,1:length(cls), 
      function(i) myid <<- i) 
} 

# each node executes this function 
getwords <- function(basename) { 
   fname <- paste(basename,".",myid,sep="")
   words <- scan(fname,what="") 
   length(words) 
} 

# manager 
wordcount <- function(cls,basename) { 
   assignids(cls) 
   clusterExport(cls,"getwords") 
   counts <- clusterCall(cls,getwords,basename)
   sum(unlist(counts)) 
}

# call example:
> library(parallel)
> c2 <- makeCluster(2)
> wordcount(c2,"words")
[1] 83


 

This couldn’t be simpler.  Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output of scan() above)
  • no configuration or platform worries
  • ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?

Count Your BLAS-ings

One nice thing about open-source software is that users often have a lot of choices.  Such is the case with R, for instance the thousands of contributed packages available on CRAN.  My focus here is on BLAS, the core of matrix operations in R, where again there are interesting choices available to users who wish to take advantage of them.

The Basic Linear Algebra Subroutines have been around for many decades, used throughout science and engineering, in various implementations.  A fairly efficient BLAS implementation is included with R, but for those with heavy linear algebra needs, several open-source alternatives are available, such as ATLAS, ACML and OpenBLAS, as well as the commercial Intel MKL.  (Recently Revolution Analytics announced an open version of their R platform that includes the MKL.)

Here I will discuss  OpenBLAS, a library currently attractive to many, due to its open-source nature and ability to make use of the multicore machines that are so common today.   I’ll focus on numerical accuracy.

This should not be considered a detailed tutorial on OpenBLAS, but here is a “hand-waving” overview of its usage and installation. Usage couldn’t be simpler, actually; you just continue business as usual,  with OpenBLAS transparently doing what base-R BLAS has always done for you.   Under more advanced usage, you might try to tweak things by setting the number of cores.

Installation is only a bit more elaborate, if you are comfortable building R from source.  At the configure stage, I ran

configure --prefix=/home/matloff/MyR311 --enable-BLAS-shlib

After running the usual make and make install.  I then needed to do a symbolic link of libRblas.so to the OpenBLAS library.  Since I’ve been doing timing comparisons for my book, I’ve made shell aliases to run either stock BLAS or OpenBLAS; a more sophisticated approach would have been to use update-alternatives.  

No doubt about it, OpenBLAS is fast, and many timing comparisons for R are to be found on the Web, including the Revo link above.  But what about numerical accuracy?  After seeing Mike Hannon’s recent post on R-help, along with Brian Ripley’s reply, I became curious, I searched the Web for information on this aspect, and came up empty-handed.  So, I  will present here the results of some simple experiments I’ve done as a result.

First, though, a disclaimer:  Although I know the basics of numerical analysis, I am not an expert in any sense, including the sense of being an expert on the various BLASes.  If anyone out there has more to add, that would be highly appreciated.

OpenBLAS derives its speed not just from making use of multiple cores, but also from various tweaks of the code, yielding a very fine degree of optimization.  One can thus envision a development team (which, by the way, took over the old Goto BLAS project) so obsessed with speed that they might cut some corners regarding numerical accuracy.  Thus the latter is a subject of legitimate concern.

For my little test here, I chose to compute eigenvalues, using R’s eigen() function.  I generated p x p unit covariance matrices (1s on the diagonal, ρ everywhere off the diagonal) for my test:

covrho <- function(p,rho) {
 m <- diag(p)
 m[row(m) != col(m)] <- rho
 m
}

I tried this with various values of p and ρ; here I’ll show the results for 2500 and 0.95, respectively.  The machine used has 16 cores, plus a hyperthreading degree of 2; OpenBLAS likely used 32 threads.

With the standard R BLAS, the elapsed time was 57.407.  Under OpenBLAS, that time was reduced to 12.101.

But interestingly, the first eigenvalue was found to be 2375.05 in both cases.  (This was the exact value in eout$values[1], where eout was the return value from eigen().)

Changing ρ to 0.995, I got reported principal eigenvalues of 2487.505 in both cases.  (Timings were roughly as before.)

As another example, I also tried finding the matrix inverse for this last matrix, using solve().   Both versions of BLAS gave 199.92 as the [1,1] element of the inverse.  Interestingly, though, there was a wider time discrepancy, 53.955 seconds versus 0.933.

It is a little odd that the numbers come out with so few decimal places.  I wonder whether R is deliberately doing some rounding, based on estimates of accuracy.  In any event, I would generally caution against looking at too many decimal places, no matter how good the accuracy is, since typically the input data itself is not so accurate.

So, it seems, at first glance, that OpenBLAS is doing fine.  But Brian has an excellent point about the value of sticking with the tried-and-true, in this case meaning, R’s default BLAS implementation.

I invite you to try your own accuracy comparisons, and post them here.

Why Are We Still Teaching t-Tests?

My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the co-posting at Revolution Analytics, and in Slashdot.  One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula.  Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.

I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.”   A precise mathematical understanding of the concepts is crucial to good applications.  But stat curricula are not realistic.

I’ll use Student t-tests to illustrate.  (This is material from my open-source book on probablity and statistics.)  The t-test is an exemplar for the curricular ills in three separate senses:

  • Significance testing has long been known to be under-informative at best, and highly misleading at worst.  Yet it is the core of almost any applied stat course.  Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
  • We prescribe the use of the t-test in situations in which  the sampled population has an exact normal distribution — when we know full well that there is no such animal.  All real-life random variables are bounded (as opposed to the infinite-support normal distributions) and discrete (unlike the continuous normal family).  [Clarification, added 9/17:  I advocate skipping the t-distribution,  and going directly to inference based on the Central Limit Theorem.  Same for regression.  See my book.]
  • Going hand-in-hand with the t-test is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n-1) times the sum of squares of our data relative to the sample mean.  The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n-1 rather than n.  Indeed, even if we do divide by n-1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ.  So that n-1 factor is much ado about nothing.

Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the t-test, has three major problems.

Sadly, the R language largely caters to this old-fashioned, unwarranted thinking.  The var() and sd() functions use that 1/(n-1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.

Much more importantly, R’s statistical procedures are centered far too much on significance testing.  Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf.  Or consider log-linear models:  The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors.  (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)

I loved the suggestion by Frank Harrell in r-devel to at least remove the “star system” (asterisks of varying numbers for different p-values) from R output.  A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made.  To be sure, R in fact is modern in many ways, but there are some problems nevertheless.

In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students.  Well, any thoughtful student can see the folly of claiming the t-test to be “exact.”  And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n-1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions.  If unbiasedness is so vital, why not use 1/(n-1) in the proportions case, a skeptical student might ask?

Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics.  Unfortunately, I’ve never been able to find it.  But a good cleanup along those lines of the way statistics is taught is long overdue.

Good for TI, Good for Schools, Bad for Kids, Bad for Stat

In my last post, I agreed with Prof. Xiao-Li Meng that Advanced Placement (AP) Statistics courses turn off many students to the statistics field, by being structured in a manner that makes for a boring class.  I cited as one of the problems the fact that the course officially requires TI calculators.  This is a sad waste of resources, as the machines are expensive while R is free, and R is capable of doing things that are much more engaging for kids.

Interestingly, this week the Washington Post ran an article on the monopoly that TI calculators have in the schools.  This was picked up by a Slashdot poster, who connected it to my blog post on AP Stat.  The Post article has some interesting implications.

As the article notes, it’s not just an issue of calculators vs. R.  It’s an issue of calculators in general vs. the TI calculator.  Whether by shrewd business strategy or just luck, TI has attained a structural monopoly.  The textbooks and standardized exams make use of TI calculators, which forces all the teachers to use that particular brand.

Further reinforcing that monopoly are the kickbacks, er, donations to the schools.  When my daughter was in junior high school and was told by the school to buy a TI calculator, I noticed at the store that Casio calculators were both cheaper and had more capabilities.  I asked the teacher about this, and she explained that TI makes donations to the schools.

All this shows why Ms. Chow, the Casio rep quoted in the article, is facing an uphill battle in trying to get schools to use her brand. But there is also something very troubling about Chow’s comment, “That is one thing we do struggle with, teachers worried about how long it is going to take them to learn [Casio products].”  Math teachers would have trouble learning to use a calculator?  MATH teachers?!  I am usually NOT one to bash the U.S. school system, but if many math teachers are this technically challenged, one must question whether they should be teaching math in the first place.  This also goes to the point in my last blog post that kids generally are not getting college-level instruction in the nominally college-level AP Stat courses.

Chow’s comment also relates to my speculation that, if there were a serious proposal to switch from TI to R, the biggest source of resistance would be the AP Stat teachers themselves.  Yet I contend that even they would find that it is easy to learn R to the level needed, meaning being able to do what they currently do on TIs—and to go further, such as analyzing large data sets that engage kids, producing nice color graphics.  This is not hard at all; the teachers don’t need to become programmers.

The Post article also brings up the issue of logistics.  How would teachers give in-class tests in an R-based AP Stat curriculum?  How would the national AP Stat exam handle this?

Those who dismiss using R for AP Stat on such logistical grounds may be shocked to know that the AP Computer Science exam is not conducted with a live programmable computer at hand either. It’s all on paper, with the form of the questions being designed so that a computer is not needed.  (See the sample test here.)  My point is that, if even a test that is specifically about programming can be given without a live computer present, certainly the AP Stat course doesn’t need one either.  For that matter, most questions on the AP Stat exam  concentrate on concepts, not computation, anyway, which is the way it should be.

The teachers should demand a stop to this calculator scam, and demand that the textbooks, AP Stat exam etc. be based on R (or some other free software) rather than on expensive calculators. The kids would benefit, and so would the field of statistics.

Statistics: Losing Ground to CS, Losing Image Among Students

The American Statistical Association (ASA)  leadership, and many in Statistics academia. have been undergoing a period of angst the last few years,  They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

  • The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
  • Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”

Good, the ASA is taking action, I thought.  But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics:  Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become.  Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R enthusiasist.

CS vs. Statistics

Let’s consider the CS issue first.  Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly.  To many of us, though, this is just  “old wine in new bottles,” with the “wine” being Statistics.  But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps.  I’ve spent most of my career in the Computer Science Dept. at the University of California, Davis, but I began my career in Statistics at that institution.  My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology.  I was one of the seven charter members of the Department of Statistics.   Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature.  With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups.  (A friend who read a draft of this post joked it should be titled “J’accuse”  but of course this is not my intention.)   However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric:  What is poor Statistics to do?

Well then, how did CS come to annex the Stat field?  The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI).  Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data.  No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days.  Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects.  Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another.  Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas.  This is dramatically demonstrated by statements that are made like,  “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics.  ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that.  The problem is not that CS people are doing Statistics, but rather that they are doing it poorly:  Generally the quality of CS work in Stat is weak.  It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented.  Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:

  • CS, having grown out of research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals.  This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work.  The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.
  • Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is heavy pressure to bring in lots of research funding, and produce lots of PhD students.  Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors,  writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand.  This is made even worse by the rapid change in the fashionable research topic de jour, making it difficult to go into a topic in any real depth.  Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing.
  • There is rampant “reinventing the wheel.”  The above-mentioned  lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature.  This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of.  For instance, consider a paper on the use of mixed labeled and unlabeled training data in classification.  (I’ll omit names.)   One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.
  • Again for historical reasons, CS research is largely empirical/experimental in nature.  This causes what in my view is one of the most serious problems plaguing CS research in Stat–lack of rigor.  Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine.  But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on.  For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; the paper really stressed this point, yet actually, one can add quadratic terms and so on to model this.
  • This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.  Most empirical work in CS doesn’t have any models to worry about.  That’s entirely  appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work.  A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments  in the world.  She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution.  She couldn’t answer–she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.
  • Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword.   On the one hand, this is a  huge plus, leading to highly impressive feats such as recognizing faces in a crowd.  But this mentality leads to  an oversimplified view of things,  with everything being viewed as a paradigm shift.  Neural networks epitomize this problem.  Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.   (Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.)  Among CS folks, there is often a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem–not because fundamentally new technology has been invented.

All this matters–a LOT.  In my opinion, the above factors result in highly lamentable opportunity costs.   Clearly, I’m not saying that people in CS should stay out of Stat research.  But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them.   This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

Making Statistics Attractive to Students

This of course is an age-old problem in Stat.  Let’s face it–the very word statistics sounds hopelessly dull.  But I would argue that a more modern development is making the problem a lot worse–the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat.  He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.”  That says it all, doesn’t it?  And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students.  No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics.  It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter.  A typical example is that a student complained to me that even though he had attended a top-quality high school in the heart of Silicon Valley, his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s2 .  But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on.  AP courses are ostensibly college level, but the students are not getting college-level instruction.  The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle.  The machines are expensive, and after all we are living in an age in which R is free!  Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides–exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually  can be fixed reasonably simply.  If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program.   Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities.  There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: an Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R.  But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry.  Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means.  There still is the issue of AP Stat being taught by people who lack depth in the field, and so on.  And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right?  Switching to R would be doable–and should be done.

A Matrix Powers Package, and Some General Edifying Material on R

Here I will introduce matpow, a package to flexibly and conveniently compute matrix powers.  But even if you are not interested in matrices, I think many of you will find that this post contains much general material on R that you’ll find useful.  Indeed, most of this post will be about general R issues, not so much about matrices per se.  So, bear with me.

Why matrix powers?  

Sadly, most university beginning linear algebra courses say very little about why the material is important.  Worse, if some motivation is presented, it tends to be physics-oriented.  Indeed, the current trend in U.S. universities is to fold the introductory linear algebra curriculum into the differential equations course.

Most readers of this blog know better, as they are (to various extents) aware of the major roles that matrices play in statistics, e.g. in linear models and principal components analysis.  But did you know that matrices play a major role in the analysis of social networks?  We’re not in Physicsland anymore, Toto.

Matrix powers are useful here.  For instance, suppose we wish to determine whether a network (or graph) is connected, meaning that every node leads to every other node in one or more steps.  (The famous Six Degrees of Separation notion comes from this context.) It turns out that can be determined by taking powers of the adjacency matrix A of the network, whose (i,j) element is 1 if there is a one-step link from i to j, 0 otherwise.  (For technical reasons, we will set the diagonal of A to all 1s.)  If some power of A has all its elements nonzero, then the graph is connected; it’s disconnected if this situation is never reached.  (One need calculate only up to the power n-1, where n is the number of rows and columns of the matrix.)  Moreover, eigenanalysis of A yields other important information about the network, and one way to compute eigenvectors of a matrix involves finding its powers too.

But why have a package to compute the powers?

Isn’t this already straightforward in R?  For instance, the third power of A is computed via the code

m %*% m %*% m

and higher powers can be coded with a simple loop.  But we needed to be much more flexible and versatile than this in developing matpow.

1.   Generality:   Our matpow package accommodates various classes of matrices.

There are many of these in the R world.  In addition to the basic “matrix” class, there is also the “Matrix” class (from the Matrix package), the “big.matrix” class (from bigmemory) and so on.  Syntax can vary; with bigmemory, for instance, one must use brackets in assignment, e.g.

m1[,]

And most important, not all of these use %*% as their multiplication operator.  For example, in gputools one uses the function gpuMatMult() for this purpose.

2.  Accommodation of parallel computation:  Today’s matrices can be huge, so parallel computation would be nice.  Our matpow package does not itself include facilities for parallel computation, but it is designed to integrate well with external parallel methods, as mentioned above for gputools, an R package that interfaces to GPUs.

3.  Capability to include callback functions, to perform application-specific operations after each new power is calculated.

For instance, consider the social network example again.  As we compute higher and higher powers of A, we want to check for an “all non zeroes” condition after each iteration; if the condition is found to hold, there is no point in doing any more iterations.  We can have our callback function check for this.

And there is more.  The (i,j) element in the rth power of A can be shown to be the number of ways to go from node i to node j in r steps. If that element is now positive but was 0 in power r-1 of A, we now know that the shortest path from i to j consists of r steps.  So, we can have our callback function watch for changes from zero to nonzero, and record them, and thus have the overall internode distance matrix in the end.

A few quick examples:

Here is a quick tour, including of the squaring option.  There we square the input matrix, then square the result and so on, thus requiring only a logarithmic number of steps to reach our desired power.  This is much faster if we don’t need the intermediate powers.

> m <- rbind(1:2,3:4)
> ev <- matpow(m,16)  # find m to 16th power
> ev$prod1  # here it is
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i # last iteration was 15th
[1] 15
> ev <- matpow(m,16,squaring=TRUE)
> ev$prod1  # same result
[,1] [,2]
[1,] 115007491351 1.67615e+11
[2,] 251422553235 3.66430e+11
> ev$i  # but with only 4 iterations
[1] 4

# test network connectivity
> m <-    rbind(c(1,0,0,1),c(1,0,1,1),c(0,1,0,0),c(0,0,1,1))
> ev <- matpow(m,callback=cgraph,mindist=T)
> ev$connected
[1] TRUE
> ev$dists
[,1] [,2] [,3] [,4]
[1,] 1 3 2 1
[2,] 1 1 1 1
[3,] 2 1 1 2
[4,] 3 2 1 1

 

How can we deal with different matrix classes/multiplication syntaxes?

As noted, one problem we needed to deal with in developing matpow was how to accommodate diverse matrix classes and multiplication syntaxes.  We solved that problem by using R’s eval() and parse() functions.  Consider the following toy example:

> x <- 28
> s <- "x <- 16"
> eval(parse(text=s))
> x
[1] 16

Of course, here it is just a very roundabout way to set x to 16.   But for matpow, it’s just what we need.  We want our multiplication command in string form to be something like prod <- m %*% m in the “matrix” case,  prod[,] <- m[,] %*% m[,] in the “big.matrix” (bigmemory) case, and so on.

The (built-in or user-supplied) argument genmulcmd (“generate multiplication command”) to matpow() handles this.  For instance, here is the built-in one for gputools:

 genmulcmd.gputools <- function (a, b, c) 
   paste(c, " <- gpuMatMult(", a, ",", b, ")")

We then can plug the string into eval() and parse().  In this manner, we can switch matrix types by changing not even a full line of code, just an argument to matpow().

How can we implement callbacks?

Recall our example matpow() call:

matpow(m,callback=cgraph,mindist=T)

Here we are specifying that there is a callback function (the default is NULL), and it is to be cgraph().  This happens to be built-in for the package, but it could be user-written.

The key issue is how data will be communicated between matpow() and the callback–in BOTH directions, i.e. we need the callback to write information back to matpow().   This is not easy in R, which as a functional language tries to avoid side effects.

Consider for example this:

> x <- sample(1:20,8)
> x
[1] 5 12 9 3 2 6 16 18
> sort(x)
[1] 2 3 5 6 9 12 16 18
> x
[1] 5 12 9 3 2 6 16 18

The point is that x didn’t change, i.e. we had no side effect to the argument x.  If we do want x to change, we must write

> x <- sort(x)

But environments are different.  The matpow() function maintains an environment ev, which it passes to the callback as an argument. Though ev is like an R list, and its components are accessed via the familiar $ operator, the difference is that the callback can change those components (a side effect), thus communicate information back to matpow().

For instance, look at this code from cgraph():

if (all(prd > 0)) { 
   ev$connected <- TRUE 
   ev$stop <- TRUE 
}

We’ve discovered that the network is connected, so we set the proper two components of ev.    Back in matpow(), the code will sense that ev$stop is TRUE, and cease the iteration process.  Since matpow() uses ev as its return value, the user then can see that the network is connected.

Other issues:

Again, keep in mind that matpow() will use whatever form of matrix multiplication you give it.  If you give it a parallel form of multiplication, then matpow()‘s computation will be parallel.  Or if your R is configured with a multicore-capable BLAS, such as OpenBLAS, you again will be computing in parallel.

Though we have gotten good performance in this context with gputools, it should be noted that gpuMatMult() returns the product to the R caller.  This causes unnecessary copying, which on GPUs can sap speed.  Code that maintains the product on the GPU from iteration to iteration would be best.

Where to get matpow():

We plan submission to CRAN, but for now download it from here. Unzip the package, then run R CMD INSTALL on the directory that is created as a result of unzipping.

New freqparcoord Example

In my JSM talk this morning, I spoke about work done by Yingkang Xie and myself, on a novel approach to the parallel coordinates method of visualization.  I’ve made several posts to this blog in the past on freqparcoord, our implemention of our method.

My talk this morning used some recently-available NYC taxi data.  You may find the discoveries made on this data by freqparcoord of interest.  See my slides from the talk.

Follow

Get every new post delivered to your Inbox.

Join 80 other followers