I just came back HotPar 2011 workshop in Berkeley. It was a very intense conference: presentations ran all day long and there were discussion assignments (complete with written reports) at lunch. Not much time for blogging or even tweeting.
What struck me most was the dominance of hybrid architectures– all the way from hardware to programming models. Questions ranged from, “How to best take advantage of various architectures?” to “How to unburden the programmer from making low level optimizations?” These are hard problems: there were many interesting proposals but no real breakthroughs. In the ideal world the programmer would just write a sequential algorithm (possibly with some parallelization hints) and the system would take care of parallelizing it appropriately for a particular architecture. As it is now, if the programmer doesn’t hand tune the algorithm to the architecture, the program won’t take full advantage of the computing capabilities of hardware. The details of the architecture leak into the programming model making it both hard to write and maintain as well as hard to port between different systems.
One area where there was some convergence of ideas was the way fragmented memory spaces were presented to the programmer. GPUs and hybrid multicores tend to have separate address spaces for each core, as in NUMA (Non-Uniform Memory Access) architectures. A component GPU performs its calculations using its own local memory. It may exchange data with other cores, or access global memory, but these are separate actions. In a way, a hybrid multicore resembles a cluster of computers connected through a network. Writing programs for such systems requires a lot of effort that is related more to the mechanics of communicating and moving data between cores and less to the problem domain. Hence the popularity of the idea of exposing to the programmer a uniform address space that combines main memory and local on-chip memories. That doesn’t mean that all areas of that global address space must be treated equally–that could nullify many of the performance advantages of hybrid chips. So, at some level, the programmer should be able to specify the distribution of data structures between cores. These are not new ideas. In distributed computing programmers have been using PGAS (Partitioned Global Address Space) for some time, and languages like Cray’s Chapel provide clean ways of separating algorithms from data partitioning (see my blog about HPCS). I talked to several speakers at HotPar, and surprisingly few of them were aware of the Chapel effort. And, as far as I know, there was nobody from Cray at the conference.
Of particular interest to me was the presentation by Hans Boehm, “How to Miscompile Programs with ‘Benign’ Data Races.” I should confess now that after publishing my blog post about benign data races I got in contact with Hans and he gave me a peek at this paper. Essentially, things are as bad or even worse than what Andy and I described in this blog. In particular Hans gave an example that showed how a redundant racy write (a store of the same value twice, from two different threads) may wreak havoc on your program. Read the paper for gory details.
All papers presented at HotPar are now available online at the conference web site.