Since the time developers found out, that increasing the frequency of the processor in order to increase its total performance, would not be efficient (or would not function at all), the research has been shifted to increasing the number of processors on the dye and enhancing communication between them, which started the multiprocessors “movement”.

The newer trend was to design multiprocessors composed of processors, that differ in the architecture (heterogeneous MP), as we have seen in previous presentations/blog posts with the IBM Cell Processor, rather than having a homogeneous multiprocessor.

The latest trend is to offload certain applications or instructions from the CPU to the GPU, creating the GPGPU “movement”.


GPUs have a huge advantage over the CPU which is described in the following chart:

(Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

GPUs are optimized for Floating Point Operations, as well as being optimized for parallelism. On typical Multiprocessor consists of 4 cores (as with most of the Intel i-Series multiprocessors), while a GPU is composed of tens of processors. This is because CPUs could be considered memory based processors while GPUs could be called ALU based, which allows the GPU to perform more operations in parallel resulting in the high GFLOPS (FLOPS = Floating Point Operation Per Second) compared to the CPU.

Basic structure of a typical CPU (left) and GPU (right) (Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

What is CUDA?

CUDA is the missing link between the developer and the GPU. It was developed by NVIDIA and is implemented in all NVIDIA GPUs starting the G80s. Before having programming architectures dedicated to programming the GPU, a programmer had to choose either between dealing with the complex APIs of the GPUs or “tricking” it by passing a texture, that contains the data or the instructions, to the GPU and then receiving a the data in the form of a texture, which typically creates a lot of overhead.

CUDA processors are programmed in CUDA C, which is basically C/C++ with some CUDA extensions, which will be mentioned and explained later on. It is important to know that in early versions of CUDA the GPU had to be programmed in C, while the CPU could be programmed in either. This is important when writing code, since the developer must know at all times, whether the code is compiled for the CPU or the GPU. Starting from CUDA 3.0 more C++ features had been enabled for the code compiled for the GPU.

CUDA Structure and Terminology

Thread : The smallest unit executing an instruction.

Block : Contains several threads.

Warp : A group of threads physically executed in parallel (usually running the same application).

Grid : Contains several thread blocks.

Kernel : An application or program, that runs on the GPU.

Device : The GPU.

Host : The CPU.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)


Threads and Blocks need to have a unique ID in order to access them while writing code. This is important, since threads and blocks are the main components when it comes to writing efficient parallel code. Within each block a thread is uniquely accessible  through the Thread ID, which is an integer between 0 and n, where n is the total number of threads within the same block. In a more complex approach – when dealing with 2D or 3D blocks – the Thread ID is calculated as a function, rather than having a fixed integer, which represents the number of the thread inside the block. Inside a 2D Block threads are positioned at (0,0), (0,1), … (n-1, n-1), as shown in the previous figure. Since the Thread ID must be of type uint3 – which will be considered a normal integer for now –  something like (3,2) for the Thread ID is not applicable.

The function for getting the Thread ID in a 2D Block is:

x + y * Dimx,

where ‘x’ and ‘y’ are the x and y indices of the thread and Dimx the x dimension of the block (in the case of the figure Dimx would be equal to 5).

In a 3D block the same applies with the difference of dealing with one more dimension which results in the function:

x + y * Dimx + z * Dimx * Dimy,

where x, y and z are the indices (same as for the 2D block) and Dimx and Dimy are the x, y dimensions of the block.

When addressing a block within a grid the same rules of addressing a thread in a block apply. We do not deal with 3D grids in CUDA, though.

Memory Model

Each thread has a private register set – which are accessible in 0 clock cycles – and local memory, which is the fastest accessible memory from the thread. Each block has  a shared memory, which can be written to/read from by all threads in the same block (hence, the name shared memory). The shared memory is the best and fastest way of communication for the threads. It is expensive to store data in the shared memory though, due to its relatively small size, therefore only variables which are needed by all the threads should be stored in it. Each grid has a global memory, which is accessible from all the blocks inside it (and therefore also the threads inside all the blocks). On a higher level each processor has its own cache and the whole device (the GPU) has a DRAM.

Of course the higher the level of the memory the bigger its size and proportionally to that is also the cost (in time units or clock cycles) to access it (be it read or write – if allowed). This is due to the increased distance from the unit trying to access the memory to the memory itself, and the latency from the memory.


It is always the most beneficial to for the developer to run a block on the same processor. Consider an example, where we have a 5 x 5 block being executed, where 25 processors are idle (this is just an example, it is unlikely -almost impossible – to have such a big number of idle processors), which means that theoretically each thread could be run by a single processor. Since the running program is a program optimised for parallelism (or else it would not make much sense to run it on a GPU and it would make even less sense to share it among 25 threads), there is a lot of data that must be shared between the threads. And since the threads run on different processors this data cannot be put in the shared memory of the block and would have to go to a higher memory (the DRAM in the worst case), and each thread will have to access this memory to retrieve the data need for operation. And just a simple comparison: accessing the shared memory as well as the global memory costs about 4 clock cycles. Accessing the global memory consumes about 400 – 600 more clock cycles for memory latency.

Hence, a block, that is divided onto several processors in most of the cases would result in a better execution time BUT the time needed to fetch data from the memory and the related idle time would result in a much worse performance than when running the block on the same processor.

What actually happens is that the processor in charge of a certain block divides the threads into warps, which are then executed in parallel. Each warp in the block is given a share of the execution time until the whole kernel is executed.

CUDA Extensions

There are four main extensions (excluding custom libraries) done to C/C++ that create the CUDA C.

1. Function Type Qualifiers

These qualifiers are written while declaring a function to decide whether the function is executed and called from the device or the host. There are three qualifiers:

  • __device__: the function is called and executed on the device
  • __shared__: the function is called from the host and executed on the device
  • __host__: the function is called and executed on the host

An function with a CUDA qualifier would have this form:

__device__ function_name (parameters),

where of course “__device__” can be replaced by any of the function qualifiers.

The two qualifiers __host__ and __device__ could be combined to create a function, that is compiled for both the host and the device.

2. Variable Type Qualifiers

Similar to the function qualifiers, the variable qualifiers decide the lifetime of a variable and in which memory it is stored.

  • __device__: the variable has the lifetime of the application and is stored in the global memory, which makes it accessible from all blocks within the grid
  • __shared__: the variable has the lifetime of the block and is stored in the shared memory, hence accessible from all threads within the block

3. Execution Configuration

When calling a global function the dimensions of the grid and the block in which this function is to be executed must be specified. This is called the execution configuration.

<<GridDimension, BlockDimension, Ns, S>>

GridDimension: Dimension of the Grid.

BlockDimension: Dimension of the Block.

Ns (optional): how much memory to allocate for this function.

S (optional): specifies the stream associated with this function.

4. Built – in Variables

There are four variables, that have been introduced to extend the C/C++ language. All of them are mainly associated with Thread/Block addressing:

  • gridDim -> specifies the dimension of the grid. Type: dim3
  • blockDim -> specifies the dimension of the block. Type: dim3
  • BlockIdx -> the unique Block ID. Type: uint3
  • ThreadIdx -> the unique Thread ID. Type: uint3


The Compiler used to compile CUDA C code is a PathScale Open64 Compiler also known as NVCC. Open64 is an open source compiler developed under the GNU License. PathScale Open64 has been further developed by the company PathScale specifically for x86-64 and Itanium processors, which is the main reason why it is optimised for parallelism.

Two important functions of the compiler are #pragma unroll, and _use_fast_math, which both result in better performance when used in coding.

#pragma unroll x: when written before a loop the following loops are then unrolled depending on the optional number ‘x’ following the #pragma unroll. There are three cases.

  1. x = 1, the following loop is not unrolled.
  2. x = n, where 1 < n, the loop is unrolled for n loops
  3. no x, the whole loop is unrolled

-use_fast_math is useful whenever the developer cares more about the performance of the written code rather than its accuracy. -use_fast_math enhances the performances by doing faster maths calculation by decreasing the accuracy of the results. Functions in -use_fast_math mode usually start with a double underscore: __mathfunction(); .

Built-in Types

The built-in types are:

char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4,  int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4,

where the first part indicates the type as we know it (in C, C++ or java), char, int, float, … , and the second part stands for the size of this type. So, for instance considering the type float4, it consists of a structure with 4 substructures which are the actual values (the actual floats). Each of those floats are accessible through the variables x,y,z and w.

As an example, we declare a variable of type float4:

float4 make_float4 (x, y, z, w);

to access x, y, z or w we write variable_name.x, variable_name.y, variable_name.z or variable_name.w, where variable_name should be replaced by the actual name of the variable, of course.

Example Code (serial -> parallel)

The following is an example of a simple code in CUDA C. The function takes as input an integer n, a float a and two float pointers x and y, and then stores in each cell in y the value a*x + y. Shown is also how this function is called.

(Source: SuperComputing Tutorial 2009 Introduction by David Luebke)

To convert this code to parallel we need to work on two things. First, we need to divide the work on the threads, so each thread could execute one of the y[i] = a*x[i] + y[i] operations. To do so each thread needs to be uniquely addressed and given a unique instruction to execute. The second thing is the function call: we need to add the execution configuration (mentioned earlier) to the function to know how many blocks and threads to allocate for this function. The following figure shows the code in implemented in parallel.

(Source: SuperComputing Tutorial 2009, Introduction by David Luebke)

In the parallel implementation of saxpy (Scalar Alpha X Plus Y) int i serves as the ID for the threads and it also corresponds to the location in the array to be read from. So, thread i reads of the i-th position of array x and y. As for the execution configuration, the function is given nblocks for the number of blocks, where nblocks is an integer depending on n, and the number of threads within each block is fixed to 256 threads per block.


To execute a single thread, what happens is that each thread reads the instructions off the memory (global or shared, depending on the instruction) and then the actual execution happens, and then the result is written back to the memory. Which means, that there is constant reading and writing from and to the memory. This is why the throughput heavily depends on the bandwidth between the threads and the memories and also the bandwidth between the CPU and the GPU. Another important aspect, that always blocks the performance is the memory latency, and as discussed before the memory latency increased the higher the memory is (global memory has a higher latency than shared memory), this is why it is smart to avoid accessing higher level memories whenever possible. As said before, accessing the shared and the global memory takes 4 clock cycles, while 400 – 600 more clock cycles are consumed due to memory latency. Sometimes it is even more beneficial to recompute data rather than caching it due to the memory latency.

Another thing bad for the total performance of the code are “if, while, do, for and switch” statements, that is because they diverge the execution path for the warps. A diverged warps are no longer executed in parallel, they must be serialised, adding to that after serialisation the diverged warp must be synchronised, which adds more instructions to be executed. Possible solutions for these statements are unrolling (as discussed earlier) and using branch prediction instead of ‘if’ and ‘switch’ statements whenever possible.

Coalesced Memory

Another important thing to take care of while writing code is coalesced and uncoalesced memory. A coalesced memory is one, where each thread reads off the address that corresponds to it. So if the base address of a certain block is n, then any thread i inside this block must access the address at: (n + i) * type_of_read, where type_of_read must be 1, 4 or multiples of 16. Any scheme other than that results in an uncoalesced memory. The following figure shows a coalesced memory.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

Both scenarios are examples of coalesced memories, whereas the right part of the figure is an example of a coalesced memory, where some threads do not participate, which results in a (relatively) insignificant worse performance.

An example of a uncoalesced memory is shown in the next figure.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

In the left example Thread 3 and Thread 4 are reading off of the wrong addresses, while in the right example all threads are shifted by 4 bytes, since the base address is 128, so Thread 0 should be reading off of address 128.

To have a concrete and strong argument why coalesced memory outperforms uncoalesced memory we take a look at the results of reading 12M of floats in three ways:

1. coalesced: 356 us

2. coalesced (some threads do not participate): 357 us -> that is why the decrease in performance was labelled as “(relatively) insignificant”

3. uncoalesced: 3494 us


The shared memory is the fastest memory in a CUDA processor following the local memories of the threads. This is a result of dividing the shared memory into banks, which allows threads to access the shared memory simultaneously. An occurring problem due to banks is bank conflicts which is the result of either two or more threads trying to access the same bank, or accessing an element, that has not equal to 32 bytes.

The first problem causing bank conflicts is obvious, a bank can only serve a single thread at a time, when two or more threads try gain access, one thread is served and the rest of the threads are serialised. As for the second problem consider the following: an array of ELEMENTS is stored in the shared memory, where the size of ELEMENTS is 8 bytes, which means, that 4 ELEMENTS are stored in one bank. Assuming thread i is accessing ELEMENTS[i], which is stored in bank number j, and thread i+1 is accessing ELEMENTS[i+1]. Typically, when dealing with a 32 byte element ELEMENTS[i+1] would be stored in bank number j+1. But since we said, that ELEMENTS is of size 8 bytes ELEMENTS[i+1] is stored in the same bank, which is bank number j just as ELEMENTS[i]. So, at the end both thread i and thread i+1 try to access the same bank, although for different elements.

CUDA Libraries

CUBLAS: CUDA accelerated Basic Linear  Algebra Subprograms

CUFFT: CUDA Fast Fourier Transform

MAGMA: Matrix Algebra on GPU Multicore  Architectures

CULA: implementation of LAPACK interface

CUDA Tools

CUDA – gdb Debugger

CUDA – Memory Checker

CUDA – Visual Profiler


More than C/C++

In order to allow a wider spectrum of developers to code in CUDA, the range of languages, that can be compiled to CUDA has been broadened. There exist converters from Fortran to CUDA (NOAA F2C-ACC), Python to CUDA (PyCUDA) and Java to CUDA (jaCUDA). Unfortunately code generated using these converters is not fully optimised, which means, that some manual optimisation is still needed to generate optimised CUDA code, when dealing with CUDA converters.