An interesting preview of things to come in parallel programming: Today was day one of the AMD Fusion Summit in Bellevue. The future, at least according to AMD (and also to Jem Davies from ARM, who was invited for one of the keynote speeches–a partnership, perhaps, between the two big players?), is in heterogeneous chips.

Interestingly, the main argument for heterogeneous computing is not performance, but energy efficiency. And it’s not only about draining battery power in mobile computers. It’s the heat barrier. The more transistors you put on a square inch of silicon, the more heat they produce. Heat dissipation is now the hardest problem facing new generations of computers, from tiny mobiles all the way to server farms and supercomputers.

General purpose CPU’s are great at running some types of programs; GPUs (or their successors, GPGPUs–General Purpose Graphics Processors) are more efficient at others. Being able to move highly parallelizable and vectorizable computations off the CPU and onto a GPU results in more efficient use of energy. So the new series of AMD Fusion processors will combine the x86 cores with high throughput GPU cores.

Those are practical considerations, but there is also a vision of the future, which is hard to ignore. We are all used to multiprocessors by now. Even if your computer or a cellphone doesn’t run highly parallelizable scientific applications, the operating system can easily keep two or four cores busy. It might even be able to take advantage of eight cores. But what about a hundred cores? If you can parallelize your program to such an extent that it saturates a hundred cores, you will most likely be using very specialized algorithms. It’s rather unlikely that you will be creating a hundred tasks each running a different piece of code. Most of them will run the same code on different data sets–the very definition of SIMD (Single Instruction Multiple Data). That’s where GPUs rule. Since you rarely have programs that are entirely SIMD, or entirely MIMD (Multiple Instructions Multiple Data), you want to be able to quickly switch between those modes. That requires some kind of tight integration–hence hybrid multiprocessors.

However, in current hybrids, the offloading of a task from the CPU to a GPU often involves copying large amounts of data and code. So AMD decided to make another bet: shared memory. (I can’t help but to point to one of my earlier blog posts about the future of concurrent programming, in which I made a similar bet). The idea is that all cores, the CPUs and the GPUs will share the same address space. Of course they will have their own caches, both private and shared.

So what about coherency, you might ask? Say hello to weak consistency memory models, complete with atomic instructions (including floating-point compare and swap).

To be totally frank, this is the worst nightmare from the programmers’ point of view. And I don’t blame AMD for that. There just doesn’t seem to be a good programming model for those monsters. AMD adopted the open standard for programming heterogeneous processors called OpenCL, but it’s a pretty low level frameowrk. I’m afraid the life of a programmer might get much harder before it gets any better. For now, it’s back to the drawing board.