Thursday, June 26, 2008

What does Cuda code look like?

A couple of posts ago I gave an overview of what Cuda is. Dr Dobbs has a great series of articles introducing Cuda. In this post, I'm going to blatantly pinch a bit of their code, to show what Cuda looks like.

Let's say we have an array of floats, and we want to increment each element in the array by 1. The regular C code for this might look like so:

void incrementArrayOnCPU(float *a, int N)
{
for (int i=0; i < N; i++)
{
a[i] = a[i]+1.f;
}
}

With Cuda we can instead run many threads at once with each incrementing one element in the array.

The function which runs on the GPU is called the 'Kernel'. The kernel can't access regular memory, so any memory it uses needs to be specially allocated. Look at the following code (which is modified from the Dr Dobbs sample for readability, and may not compile now).

void IncrementArray()
{
// Create an array of 102 floats
#define N 102
float mainMemoryArr[N];
for (i=0; i<N; i++)
mainMemoryArr[i] = (float)i;

// Copy data to the graphics card memory
float *gpuMemoryArr;
cudaMalloc((void **) &gpuMemoryArr, size);
cudaMemcpy(gpuMemoryArr, mainMemoryArr, sizeof(float)*N, cudaMemcpyHostToDevice);

// Calculate how many 'blocks' of threads we'll run
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

// Run the GPU function
incrementArrayOnGPU <<< nBlocks, blockSize >>> (gpuMemoryArr, N);

// Copy the modified memory back to main memory
cudaMemcpy(mainMemoryArr, gpuMemoryArr, sizeof(float)*N, cudaMemcpyDeviceToHost);
}

So the code above does the following:

  • Allocates memory on the graphics card.
  • Copies the array there.
  • Runs 26 'blocks' of 4 threads each. (A total of 104 kernel executions)
  • Copies the modified memory back to main memory (which automatically blocks on the kernel threads completing).

The 'kernel' function which runs on the graphics card looks like this:

__global__ void incrementArrayOnGPU(float *arr, int N)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
arr[idx] = arr[idx]+1.f;
}

So it looks like C, apart from a couple of small differences and notes:

  • The keyword __global__ identifies the function as a kernel function which will execute on the GPU.
  • There are 'blockIdx', 'blockDim' and 'threadIdx' variables, which are used to calculate the 'index' of this thread. I'll not go into the details of what exactly they mean, but it's pretty simple.
  • The 'if(idx<N)' part is because the kernel will actually be called for slightly more than the number of threads we request.

The 'blocks' construct seems like a bit of nuisance. It exists because that's the way the hardware works. I believe you could specify a block size of 1 to simplify things, but the reality would be that you'd have several of the processing elements sitting idle. You need to experiment with different block sizes to find which gives your program the best performance.

This was a trivial example of what can be done with Cuda. In reality, the kernel function would likely be doing quite a bit more work. Overall, if you're familiar with C, I think you can see that programming with Cuda is not too strange. There's just a couple of extra keywords and functions to call, but it's quite easy to understand...

If you want to learn more, start by reading the Dr Dobbs articles, and downloading the SDK from the Cuda Zone. The SDK comes with a load of samples for you to prod at. You can run in software emulation mode if you don't have a swanky graphics card, but it's pretty slow.

3 comments:

Anonymous said...

The problem with CUDA IMHO is that it's a little too specific to the G80/92/T200 architecture. It doesn't map naturally to other architectures with different memory hierarchies and although it can be "made to work", something a bit more abstract is needed for a standard that is meant to be targeted to a wide range of parallel processors with varying memory hierarchies.

The other problem with CUDA is that it's just too damn hard to make it fast/optimal ;) This is more a problem with the complexity of the underlying hardware than the language itself, but the point remains that the language does nothing to prevent you from seriously shooting yourself in the foot, which is never a good thing. As it stands, even simple problems require highly non-linear optimization and machine-learning style optimization algorithms (http://www.crhc.uiuc.edu/IMPACT/ftp/conference/cgo-08-ryoo.pdf) to even approach 50% of peak performance. There are just too many variables that affect performance in highly non-linear ways for us mere mortals to get right ;)

Now the above is just a tough problem with parallel programming and complex architectures in general, but it begs the question as to whether we need to be specifying algorithms in something a bit more general and tunable than CUDA, and then the backend/compilers can handle the heavy-lifting as far as optimization and targeting to a specific memory model go.

Anyways there are certainly many interesting topics moving forward, and it will be fascinating to see what falls out of OpenCL and similar initiatives (DX compute shaders, etc).

Anonymous said...

CUDA without the CUDA :)

BSGP: Bulk-Synchronous GPU Programming

It should allow you to write code the will support different GPUs. Currently Nvidia/CUDA is only implemented. ANd it looks easier to deal with. Here is a link to a pdf on it: http://www.kunzhou.net/2008/BSGP.pdf

Anonymous said...

Keep up the good work.