Thursday, June 26, 2008

What does Cuda code look like?

A couple of posts ago I gave an overview of what Cuda is. Dr Dobbs has a great series of articles introducing Cuda. In this post, I'm going to blatantly pinch a bit of their code, to show what Cuda looks like.

Let's say we have an array of floats, and we want to increment each element in the array by 1. The regular C code for this might look like so:

void incrementArrayOnCPU(float *a, int N)
{
for (int i=0; i < N; i++)
{
a[i] = a[i]+1.f;
}
}

With Cuda we can instead run many threads at once with each incrementing one element in the array.

The function which runs on the GPU is called the 'Kernel'. The kernel can't access regular memory, so any memory it uses needs to be specially allocated. Look at the following code (which is modified from the Dr Dobbs sample for readability, and may not compile now).

void IncrementArray()
{
// Create an array of 102 floats
#define N 102
float mainMemoryArr[N];
for (i=0; i<N; i++)
mainMemoryArr[i] = (float)i;

// Copy data to the graphics card memory
float *gpuMemoryArr;
cudaMalloc((void **) &gpuMemoryArr, size);
cudaMemcpy(gpuMemoryArr, mainMemoryArr, sizeof(float)*N, cudaMemcpyHostToDevice);

// Calculate how many 'blocks' of threads we'll run
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

// Run the GPU function
incrementArrayOnGPU <<< nBlocks, blockSize >>> (gpuMemoryArr, N);

// Copy the modified memory back to main memory
cudaMemcpy(mainMemoryArr, gpuMemoryArr, sizeof(float)*N, cudaMemcpyDeviceToHost);
}

So the code above does the following:

  • Allocates memory on the graphics card.
  • Copies the array there.
  • Runs 26 'blocks' of 4 threads each. (A total of 104 kernel executions)
  • Copies the modified memory back to main memory (which automatically blocks on the kernel threads completing).

The 'kernel' function which runs on the graphics card looks like this:

__global__ void incrementArrayOnGPU(float *arr, int N)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
arr[idx] = arr[idx]+1.f;
}

So it looks like C, apart from a couple of small differences and notes:

  • The keyword __global__ identifies the function as a kernel function which will execute on the GPU.
  • There are 'blockIdx', 'blockDim' and 'threadIdx' variables, which are used to calculate the 'index' of this thread. I'll not go into the details of what exactly they mean, but it's pretty simple.
  • The 'if(idx<N)' part is because the kernel will actually be called for slightly more than the number of threads we request.

The 'blocks' construct seems like a bit of nuisance. It exists because that's the way the hardware works. I believe you could specify a block size of 1 to simplify things, but the reality would be that you'd have several of the processing elements sitting idle. You need to experiment with different block sizes to find which gives your program the best performance.

This was a trivial example of what can be done with Cuda. In reality, the kernel function would likely be doing quite a bit more work. Overall, if you're familiar with C, I think you can see that programming with Cuda is not too strange. There's just a couple of extra keywords and functions to call, but it's quite easy to understand...

If you want to learn more, start by reading the Dr Dobbs articles, and downloading the SDK from the Cuda Zone. The SDK comes with a load of samples for you to prod at. You can run in software emulation mode if you don't have a swanky graphics card, but it's pretty slow.

Introduction to OpenMP - Easy multi-threading for C/C++

As well as other compilers, Visual Studio 05 and 08 (Professional and Team System editions) support the OpenMP v2 standard. It’s a collection of language extensions which make it easier to create multi-threaded code. Here I’ll give a quick taster of two of the OpenMP ‘directives’ – ‘Parallel’ and ‘For’.
Parallel allows you to specify a block of code which executes multiple times in parallel:

void ParallelTest() 
{  
   #pragma omp parallel num_threads(4)  
  {  
      int i = omp_get_thread_num();  
      printf_s("ParallelTest: Hello from thread %d\n", i);  
  }  
}
Results in the output:
ParallelTest: Hello from thread 0
ParallelTest: Hello from thread 3
ParallelTest: Hello from thread 1
ParallelTest: Hello from thread 2

The ‘for’ directive allows a for loop to be split up and executed in several threads:
void ForTest() 
{  
  #pragma omp parallel for  
   for (int j=0; j<=4; ++j)   
   {  
       int threadNum = omp_get_thread_num();  
       printf_s("ForTest: Hello from thread %d            - loop iteration %d\n", threadNum, j);  
   }  
}
Results in the output:
ForTest: Hello from thread 0 - loop iteration 0
ForTest: Hello from thread 2 - loop iteration 3
ForTest: Hello from thread 1 - loop iteration 2
ForTest: Hello from thread 3 - loop iteration 4
ForTest: Hello from thread 0 - loop iteration 1




The thing which I find really cool about OpenMP is that it requires such little change to the code. Normally you’d need to write a lot of lines of code to accomplish the same results we see above, but with OpenMP it’s often as little as a single line – which you can comment out to go back to a single threaded approach. So if you find an inner loop which is hogging performance, and lends itself to parallelization, you can quickly make it multithreaded. It's supported on Xbox 360, but not across all game consoles. It's certainly useful for speeding up your tools.
For a list of other compilers which support OpenMP, see here. Note that Visual Studio 2005 does support it, but isn’t listed, as they just list the latest version.
To enable OpenMP in your Visual Studio C++ project, you need to:
  • #include <omp.h> // Include OpenMP
  • Go to your ‘Project properties->C/C++->Language->OpenMP Support’, and set it to ‘Yes’.
To find out more about these and the other OpenMP directives, you can see a list on MSDN.
In my previous post, I introduced Cuda. These two technologies are very different - OpenMP requires minimal code change and runs on CPU, while Cuda requires much more work but allows you to run on the GPU with many more cores.

Tuesday, June 24, 2008

What is Cuda?

nvidia_geforce_300 Graphics cards used to have a very fixed purpose - just graphics, but over the last eight or so years they've become more and more programmable, to the point that they're now becoming quite a bit like regular CPU's.
I recently took a look into Cuda. It's a technology from nVidia which allows you to run C code on the GPU of the GeForce 8 series and above.
The cool thing is that the top end nVidia card currently has 240 processing units, and you can put a couple in one machine. So for perhaps around a couple of grand, you can get a 480 core PC - kinda. You don't have to spend that much though; at the time of writing, a 128 processor GeForce 9800 is $200, and a 32 processor 8800 is only $70.


So the promise is that you can write bits of C code, and then execute the same code many times in parallel on the graphics card, beating how long it would take on CPU.
The Cuda website lists almost a hundred projects which have used Cuda, with performance improvements ranging from 2x to 240x. The main reason for the big range is that Cuda's not good for everything. If the piece of code gets large, uses a lot of memory, contains branching and random/scattered memory access, then performance goes down. The highest performance is for short programs with little branching and little memory access.
Some problems are inherently more suited to Cuda-fying than others, but the programmer can still make a big difference by writing their code with the hardware in mind. Working at a low level like this, learning the hardware, and optimizing code to run as efficiently as possible on it, is very similar to the kind of skills used when coding for high performance on games systems, and in particular to PS3 Cell programming. So if you're looking to get into games, then impressing the programmers who interview you with your Cuda skills would be a great way to get your foot in the door!

Thursday, June 05, 2008

New Wall-e Trailer

We've been busy finishing of the game for the Wall-e movie. It's now been approved and is on its way to stores.

Here's a fun trailer which introduces some of the robots from the film.

Monday, June 02, 2008

Free Visual Studio Professional - for Students

Wow! Microsoft is giving away copies of Visual Studio Professional, Expression Studio, Windows Server Standard, XNA Studio 2.0 and more for free to students at certain universities.

Hmm, looks like I've found another use for the interns... :)