A couple of posts ago I gave an overview of what Cuda is. Dr Dobbs has a great series of articles introducing Cuda. In this post, I'm going to blatantly pinch a bit of their code, to show what Cuda looks like.
Let's say we have an array of floats, and we want to increment each element in the array by 1. The regular C code for this might look like so:
void incrementArrayOnCPU(float *a, int N)
{
for (int i=0; i < N; i++)
{
a[i] = a[i]+1.f;
}
}
With Cuda we can instead run many threads at once with each incrementing one element in the array.
The function which runs on the GPU is called the 'Kernel'. The kernel can't access regular memory, so any memory it uses needs to be specially allocated. Look at the following code (which is modified from the Dr Dobbs sample for readability, and may not compile now).
void IncrementArray()
{
// Create an array of 102 floats
#define N 102
float mainMemoryArr[N];
for (i=0; i<N; i++)
mainMemoryArr[i] = (float)i;
// Copy data to the graphics card memory
float *gpuMemoryArr;
cudaMalloc((void **) &gpuMemoryArr, size);
cudaMemcpy(gpuMemoryArr, mainMemoryArr, sizeof(float)*N, cudaMemcpyHostToDevice);
// Calculate how many 'blocks' of threads we'll run
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
// Run the GPU function
incrementArrayOnGPU <<< nBlocks, blockSize >>> (gpuMemoryArr, N);
// Copy the modified memory back to main memory
cudaMemcpy(mainMemoryArr, gpuMemoryArr, sizeof(float)*N, cudaMemcpyDeviceToHost);
}
So the code above does the following:
- Allocates memory on the graphics card.
- Copies the array there.
- Runs 26 'blocks' of 4 threads each. (A total of 104 kernel executions)
- Copies the modified memory back to main memory (which automatically blocks on the kernel threads completing).
The 'kernel' function which runs on the graphics card looks like this:
__global__ void incrementArrayOnGPU(float *arr, int N)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
arr[idx] = arr[idx]+1.f;
}
So it looks like C, apart from a couple of small differences and notes:
- The keyword __global__ identifies the function as a kernel function which will execute on the GPU.
- There are 'blockIdx', 'blockDim' and 'threadIdx' variables, which are used to calculate the 'index' of this thread. I'll not go into the details of what exactly they mean, but it's pretty simple.
- The 'if(idx<N)' part is because the kernel will actually be called for slightly more than the number of threads we request.
The 'blocks' construct seems like a bit of nuisance. It exists because that's the way the hardware works. I believe you could specify a block size of 1 to simplify things, but the reality would be that you'd have several of the processing elements sitting idle. You need to experiment with different block sizes to find which gives your program the best performance.
This was a trivial example of what can be done with Cuda. In reality, the kernel function would likely be doing quite a bit more work. Overall, if you're familiar with C, I think you can see that programming with Cuda is not too strange. There's just a couple of extra keywords and functions to call, but it's quite easy to understand...
If you want to learn more, start by reading the Dr Dobbs articles, and downloading the SDK from the Cuda Zone. The SDK comes with a load of samples for you to prod at. You can run in software emulation mode if you don't have a swanky graphics card, but it's pretty slow.