Thursday, June 16, 2011

Microsoft's C++ AMP (Accelerated Massive Parallelism)

Microsoft has just announced a new parallel programming technology called C++ AMP (which stands for Accelerated Massive Parallelism). It was unveiled in a keynote by Herb Sutter at AMD's Fusion Developer Summit 11. Video and slides from the keynote are available on MSDN Channel 9 here (Herb begins talking about C++ AMP around a half hour into the keynote).

The purpose of C++ AMP is to tackle the problem of heterogeneous computing. Herb argues for a single programming platform that can account for the differences in processing ability and memory models of CPUs, GPUs, and Infrastructure-as-a-Service (IaaS) cloud platforms. By basing it off of C++ 0x, such a platform could provide the abstractions necessary for productivity, but also allow the best performance and hand-tuning ability. Let's dive straight into the code with an example given during Herb's keynote:

void MatrixMult( float* C, const vector<float>& A, const vector<float>& B,
int M, int N, int W )
{
	array_view<const float,2> a(M,W,A), b(W,N,B);
	array_view<writeonly<float>,2> c(M,N,C);
	parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) {
		float sum = 0;
		for(int i = 0; i < a.x; i++)
			sum += a(idx.y, i) * b(i, idx.x);
		c[idx] = sum;
	} );
}


This is a function that performs floating-point matrix multiplication. I'll try a bottom-up approach and go line by line to see what's new with C++ AMP. There is certainly nothing different from regular C++ in the function argument list (Disclaimer: my knowledge of C++ is minimal; school has caused me to stick with C). The next few lines, though, introduce a class called an array_view. Herb described it in the keynote as an iterable array abstraction. We need this abstraction because we have no idea about the underlying memory model for the system our code is executing on. For example, if we are developing for an x86-64 CPU, then we have one coherent 64-bit address space. But if we are using a discrete GPU, then that GPU may have its own completely different address space(s). With IaaS platforms, we may be dealing with incoherent memory as well. The array_view will perform any memory copies or synchronization actions for us, so that our code is cleaner and can run on multiple platforms.

Next up is the parallel_for_each loop. This is surprisingly not a language extension by Microsoft, but just a function. Microsoft's engineers determined that by using lambda functions (a new feature of C++ 0x) as objects to define their compute kernels, they can avoid extending C++ to include all sorts of data-parallel for loops. In this case, a lambda function is executed that calculates the dot product of a row of a and a column of b over the grid defined by the output array_view c. It seems that the lambda function takes a 2D iterator as an argument to traverse the arrays.

There is one keyword that I didn't explain, which is restrict. Herb says in the keynote that this is the only extension they had to make to C++ 0x to realize C++ AMP. restrict provides a compile-time check to ensure that code can execute on platforms of different compute capability. For instance, restrict(direct3d) ensures that the defined function will not attempt to execute any code that a DirectX 11-class GPU could not execute (such as throwing an exception or using function pointers). With this keyword, C++ AMP can have one body of code that runs on multiple platforms despite varying processor designs.

The ideas presented in this example itself make me excited about this platform. We only have to write whatever data-parallel code we need and the runtime can take care of the details for us. This was the promise of OpenCL, but C++ AMP does take the concept further. There is no new language subset to account for the threading and memory models of GPUs. There is no need to worry about which compute node's memory space the data is at. It also seems from this example that there is no need to size our workload for different thread and block counts like in CUDA; the runtime will handle that too. Microsoft showed an impressive demo of an n-body collision simulation program that could run off one core of a CPU, the on-die GPU of Fusion APUs, discrete GPUs, or even a discrete GPU and Fusion GPU at the same time, all using one executable. They simply changed an option from a GUI dropdown list to choose the compute resource to use.

There are plenty of details left to be answered, though. While Herb said in the keynote that developers will be free to performance tune, we don't know how much we can control execution resources like thread blocks. We also don't know what else is available in the C++ AMP API. Additionally, while Microsoft promises C++ AMP will be an open specification, the dependence on DirectCompute questions the notion of quality implementations on non-Windows platforms. Hopefully the hands-on session given at the Fusion summit by Daniel Moth will be posted online soon, and we can see what details were uncovered then.

The announcement by Soma Somasegar notes that C++ AMP is expected to be part of the next Visual C++ and Visual Studio release. Herb announced in the keynote that AMD will release a compiler supporting C++ AMP for both Windows and, interestingly, non-Windows platforms. NVIDIA also announced their support, while noting that CUDA and thrust is still the way to go ;). With the support of the discrete GPU vendors (note: nothing from Intel yet...) and the most popular development environment, C++ AMP has the potential to bring heterogeneous computing to a much larger developer market than what CUDA or OpenCL can do in their current form. I won't underestimate the ability of CUDA or OpenCL to catch up in ease-of-use by its release, though. In any case, I look forward to simpler GPU computing times ahead.

No comments:

Post a Comment