Snowdoop/partools Update

I’ve put together an updated version of my partools package, including Snowdoop, an alternative to MapReduce algorithms.  You can download it here, version 1.0.1.

To review:  The idea of Snowdoop is to create your own file chunking, rather than having something like Hadoop do it for you, and then using ordinary R coding to perform parallel operations.  This avoids the need to deal with new constructs and complicated configuration issues with Hadoop and R interfaces to it.

Major changes are as follows:

  • There is a k-means clustering example of Snowdoop in the examples/ directory.  Among other things, it illustrates the fact that with the Snowdoop approach, one automatically achieves a “caching” effect lacking in Hadoop, trivially by default.
  • There is a filesort() function, to sort a distributed file, keeping the result in memory in distributed form.  I don’t know yet how efficient it will be relative to Hadoop.
  • There are various new short utility functions, such as filesplit().

Still not on Github yet, but Yihui should be happy that I converted the Snowdoop vignette to use knitr. 🙂

All of this is still preliminary, of course.  It remains to be seen to what scale this approach will work well.

4 thoughts on “Snowdoop/partools Update”

  1. Thanks! Okay, I downloaded the tar ball and opened it (which I normally would not do). The vignette was still using Sweave. If the package is on Github, I can send you a pull request in two minutes to fix this issue, and you can accept it in 10 seconds. If it is a tar ball, there will be more steps back and forth through emails, and my brain will start to hurt just by thinking of that 🙂

    1. Yeah, when I said knitr, I meant the general category of “R 纺织.” 🙂 I do recommend knitr to my students, but I wound up stopping short of using it here.

      I agree with your analysis of the virtues of GitHub. If partools/Snowdoop grows to a larger, more complex state, it would definitely be worthwhile.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.