July | 2017 | Mad (Data) Scientist

In my talk at useR! earlier this month, I emphasized the fact that a major impediment to obtaining good speed from parallelizing an algorithm is systems overhead of various kinds, including:

Contention for memory/network.
Bandwidth limits — CPU/memory, CPU/network, CPU/GPU.
Cache coherency problems.
Contention for I/O ports.
OS and/or R limits on number of sockets (network connections).
Serialization.

During the Q&A at the end, one person in the audience asked how R programmers without a computer science background might acquire this information. A similar question was posed here today by a reader on this blog, to which I replied,

That question was asked in my talk. I answered by saying that I have an introduction to such things in my book, but that this is not enough. One builds this knowledge in haphazard ways, e.g. by search terms like “cache miss” and “network latency” on the Web, and above all, by giving it careful thought and reasoning things out. (When Nobel laureate Richard Feynman was a kid, someone said in awe, “He fixes radios by thinking!”)

Join an R Users Group, if there is one in your area. (And if not, then start one!) Talk about these things with them (though if you follow my above advice, you may find you soon know more than they do).

The book I was referring to was Parallel Computing for Data Science: With Examples in R, C++ and CUDA (Chapman & Hall/CRC, The R Series, Jun 4, 2015.

I have decided that the topic of system overhead issues in parallel computation is important enough for me to place Chapter 2 on the Web, which I have now done. Enjoy. I’d be happy to answer your questions (of a general nature, not on your specific code).

We are continuing to add more features to our R parallel computation package, partools. Watch this space for news!

By the way, the useR! 2017 videos are now on the Web, including my talk on parallel computing.

I gave a talk titled, “Parallel Computation in R: What We Want, and How We (Might) Get It,” at last week’s useR! 2017 conference in Brussels. You can view my slides here, and I think the conference organizers said the videos would be placed online, not sure of that though.

The goal of the talk was to propose general design patterns for parallel computation in R, meaning general approaches that should be useful in many applications. I emphasized that this was just one person’s opinion, and expected the Spark fans to disagree with my view that Spark is not a very useful tool for useRs. Actually, several speakers in other talks were negative about Spark as well. One gentleman did try to defend Spark during the Q&A, but he talked to me afterward, and turned out not to be a huge Spark fan after all, largely just playing the devil’s advocate.

My examples of course involved partools, the package I’ve been developing for parallel computation in R. (Duncan Temple Lang’s PhD student Clark Fitzgerald is now involved in developing the package as well.) However, I noted that the same general principles could be applied with some other packages, such as ddR and multidplyr.

There were of course a number of excellent talks, many more than I could attend. Among the ones I did attend, I would mention a few in particular:

A talk by Nick Ulle, another student of Duncan’s, about his project to bring the LLVM compiler world to R. This is a tough challenge, but Nick is making impressive progress.
A talk by Kylie Bemis, a post doc at Northeastern University, and her matter file system R package, which does distributed file allocation in a clever, general manner.
I did not get to see Jim Harner’s talk about his R IDE, rc²but he demonstrated it for me on his laptop, very interesting.
Microsoft’s David Smith, one of the pioneers of the S/R world, gave an interesting “then and now” talk, listing questions that non-useRs would ask a few years ago when he suggested their switching to R — but which they no longer ask, demonstrating the huge increase in R usage in recent years, and its increase in power and usability.

My wife and I had fun exploring Brussels — one wrong decision in a subway station resulted in our ending up in front of the EU headquarters, an interesting error to make. And by an amazing stroke of good luck, the other summer conference at which I’ll be giving a talk, Small Area Estimation 2017, is to be held in Paris the very next week.

	Anonymous on Just How Good Is ChatGPT in Da…
	Quantile Regression… on Quantile Regression with Rando…
	Anonymous on Quantile Regression with Rando…
	Sina Özdemir on qeML Example: Nonparametric Qu…
	Anonymous on qeML Example: Nonparametric Qu…

Mad (Data) Scientist

Monthly Archives: July 2017

Understanding Overhead Issues in Parallel Computation

My Presentation at useR! 2017, Etc.

Musings, useful code etc. on R and data science