May 2010



Introduction:-
The aim of technology has always been solving problems facing humanity. The aim isn’t only to
solve problems, but to solve them efficiently in the least possible time. With time problems
become more complex to solve (i.e.: better hardware needed to solve these problems quickly).
In the field of Computer Science, the hardware component responsible for how fast will a
problem be solved is the processor.
In the past hardware corporations like Intel used to introduce faster processor each year. But at
some point they figured out that they wouldn’t be able to introduce new uni-core processors
with more transistors; and since the speed of a processor is directly proportional with the
number of transistors, it wasn’t feasible to have faster uni-core processors.
Hardware corporations found that the way to introduce faster processors is to introduce multicore
processors. Introducing multi-core processors resulted in a problem that former
programming models don’t support multi-core processors programming. So, new programming
models and languages were introduced for programmers to be able to utilize the presence of
multi-cores in one processor. The article demonstrates one of these new models which is
Cilk++.

Introduction:-The aim of technology has always been solving problems facing humanity. The aim isn’t only tosolve problems, but to solve them efficiently in the least possible time. With time problemsbecome more complex to solve (i.e.: better hardware needed to solve these problems quickly).In the field of Computer Science, the hardware component responsible for how fast will aproblem be solved is the processor.In the past hardware corporations like Intel used to introduce faster processor each year. But atsome point they figured out that they wouldn’t be able to introduce new uni-core processorswith more transistors; and since the speed of a processor is directly proportional with thenumber of transistors, it wasn’t feasible to have faster uni-core processors.Hardware corporations found that the way to introduce faster processors is to introduce multicoreprocessors. Introducing multi-core processors resulted in a problem that formerprogramming models don’t support multi-core processors programming. So, new programmingmodels and languages were introduced for programmers to be able to utilize the presence ofmulti-coresin one processor. The article demonstrates one of these new models which isCilk++.

Cilk++ VS C++:-


Cilk++ development started in 1994 in of the MIT labs. Cilk++ is based on C++. So, writing Cilk++

is exactly the same as writing C++ with the ability of programming parallel applications with the

use of some new keywords specifically introduced to Cilk++ to enable parallel programming.

The new keywords are: cilk, spawn, sync, inlet and abort. The “cilk” keyword is added to the

header of any function to notify the compiler that this function may include parallel

programming (i.e.: one of the other Cilk++ specific keywords might be included in this function).

Below is a detailed description for the rule of each of the other keywords.

Spawn:

 

The “spawn” keyword can be added to the start of any line of code to notify the processor that

it can execute this line of code on a separate core of the processor if possible. This line of code

might be a call for function or even a set of functions giving the ability to run a piece of code

rather

Sync:

The “sync” keyword is closely related to the “spawn” keyword. After one or more lines of code

has been “spawned” the “sync” keyword can be put afterwards to notify the processor that

should stop executing the code till all spawned processes finish processing. The advantage of

“sync” is to synchronize parallel processes in order to ensure secure coding (i.e.: no problem

will take place as a result of a process demanding resources that are being used by another

process running in parallel). Below is a figure showing the functionality of “sync”.

Inlet:

The “inlet” keyword is used in advanced programming. Whenever a parent process spawns

other processes, these child processes are supposed to return results to the parent process.

Inlet is used in order to make sure that no write conflicts will take place as a result of multiple child processes writing in a variable in the same parent process. An “inlet” is like a function that

can be written inside another function to control how return values of child processes are to be

written to the parent process and to ensure that only one child process can write to the return

value of the parent process at a time. The code below shows how “inlet” can be used in a

function used to calculate Fibonacci numbers.

cilk int fib (int n)

{

int x = 0;

inlet void summer (int result)

{

x += result;

return;

}

if (n<2) return n;

else {

summer(spawn fib (n-1));

summer(spawn fib (n-2));

sync;

return (x);

}

}

Abort:

The “abort” keyword is also used in advanced programming. Sometimes a process is spawned

in multiple processes, and at some point the result is reached and no further processing of

parallel processes is needed, so the abort keyword is used to stop the unneeded processing of

these other processes. For example, if a search algorithm works by means of having multiple

processes working in parallel on different parts of an array of integers or so, if one of the

processes finds the element the algorithm is searching for, it notifies the processor to stop the

execution of the rest of the processes in order not to waste time and resources.

Parallelism:-

 

 

 

 

 

 

Parallelism is an attribute whose value can be calculated to show how much beneficial was

Cilk++ for an algorithm over running the same algorithm in the ordinary sequential way. To

calculate the value of parallelism for an algorithm both work and depth has to be calculated for

this algorithm. Work is equivalent to the total number of operations done by the algorithm.

Depth is the length of the longest sequential dependence chain. Parallelism is equivalent to the

value of “work” over the value of “depth”. The figure below shows an example for parallelism.

Cilk work-stealing scheduler:-

 

The Cilk work-stealing scheduler is one of the most beneficial features of Cilk. The aim from it is

to ensure maximum utilization of processor’s capabilities for programs to execute in shorter

time. For example, in case of a processor with X cores and a program is running X processes on

its X cores waiting for them to finish in order to output the result; if one of these processes

finishes before the others, it starts taking tasks from the bottom of other process’s stack. The

figure below shows two processes running in parallel on two different cores, where one of

them finishes before the other then use the work-stealing feature to help the unfinished

process in its work.

Practical examples:-

Fibonacci:

Calculating Fibonacci numbers is a good example for an operation that would run in much less

time if processed in parallel on multiple cores of a processor. As shown below in the code and

figure, the recursion of the function is processed in parallel which resulted in executing the

function in much less time.

Without parallelization:

 

int fib (int n)

{

if (n <= 2) return 1;

else

{

int x, y;

x = fib (n-1);

y = fib (n-2);

return x + y;

}

}

With parallelization:

int fib (int n)

{

if (n <= 2) return 1;

else

{

int x, y;

x = cilk spawnfib (n-1);

y = cilk spawnfib (n-2);

cilk sync;

return x + y;

}

}

Quick Sort:

Quick sort is regarded as one of the efficient algorithms for sorting data structures of data

according to a specific criterion for sorting. Quick sort is one of those algorithms that would run

in much less time if implemented using Cilk++ to run on multi-core processors. The idea behind

quick sort is to take the 1st element of data and put it its correct order then repeat the same

thing for the items less than it and the items greater than it till all the elements are sorted.

Below is the algorithm along with a figure to demonstrate how a multi-core processor is utilized

to run it when implemented using Cilk++.

 

c static void QuickSort(index low, index high)

{

if (low < high) then

{

index p = Partition(low, high);

cilk spawn QuickSort (low, p -1);

cilk spawn QuickSort (p +1, high);

cilk sync;

}

}

public static index Partition(index low, index high)

{

Index i,

j,

p;

keytype pivot;

pivot = S[low];

j = low;

for (i = low + 1; i <= high; i++) do

if (S[i] < pivot) then

{

j++;

Exchange S[i] and S[j];

}

p = j;

Exchange S[low] and S[p];

return p;

}

Cilk++ VS OpenMP:-

OpenMP is another programming model for parallel programming. Below is a diagram showing

difference in execution time between Cilk++ and OpenMP when executing the quick sort code

for both the adaptive and parallel models.

Cilk++:-

Cilk++ is available for windows and linux operating systems and can be downloaded from the

following link: http://software.intel.com/en-us/articles/download-intel-cilksdk/

Cilk++ for MAC was a project under development that was never completed and not under

development anymore.

References:-

[1] Prof. Richard (Rich) Vuduc, Introduction to Cilk++. Georgia Tech, College of Computing,

2010.

[2] Mingdong Feng1, Charles E. Leiserson2, Efficient Detection of Determinacy Races in Cilk

Programs. 1 National University of Singapore, 2 MIT Laboratory for Computer Science.

[3] Matteo Frigo, Multithreaded Programming in Cilk. CILK ARTS.

[4] Charles E. Leiserson, Aske Plaat, Programming Parallel Applications in Cilk. MIT Laboratory

for Computer Science.

[5] Cilk 5.4.6 Reference Manual. MIT Laboratory for Computer Science.

[6] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H.

Randall, Yuli Zhou, Cilk: An Efficient Multithreaded Runtime System. MIT Laboratory for

Computer Science.

 

 


History

OpenCL was initially developed by Apple, which holds trademark rights, in collaboration with technical teams at  AMD, IBM, Intel, Nvidia, Ericsson, Nokia, Texas, and Motorola. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008 the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details. On December 8,2008 OpenCL was released.

Introduction

If you’ve never heard of OpenCL, you need to stop whatever you’re doing and read ahead. We all know that multi-threaded applications have not been as abundant as we had hoped. For those precious few applications that are multi-core aware, few leverage the full potential of two cores. That is why  OpenCL was developed, to standardize parallel programming and execution.

OpenCL  architecture shares a range of computational interfaces with two competitors, NVidia’s Compute Unified Device Architecture and Microsoft’s directCompute.

What is OpenCL?

OpenCL  is a framework for writing programs that execute across  heterogeneous platforms consisting of CPUs , GPUs , Cell, DSP and  other processors. It  includes a language for writing kernels (OS),  plus APIs that are used to define and then control the platforms.

The Khronos Group, hopes that OpenCL will do for multi-core what OpenGL did for graphics, and OpenAL is beginning to do for audio, and that’s exactly what OpenCl achieved. OpenCL improved speed for a wide spectrum of applications from gaming, entertainment to scientific and medical software.

The following link is a link to a  video which shows to what extent OpenCl speeds up the execution of an application.

OpenCl Demo

How does OpenCl work?

OpenCL includes a language for writing compute kernels and APIs for creating and managing these kernels. The kernels are compiled, with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system.

Platform Model

One of OpenCL’s strengths is that this model does not specify exactly what hardware constitutes a compute device. Thus, a compute device may be a GPU, or a CPU.

OpenCL sees today’s heterogeneous world through the lens of an abstract, hierarchical platform model. In this model, a host coordinates execution, transferring data to and from an array of Compute Devices. Each Compute Device is composed of an array of Compute Units, and each Compute Unit is composed of an array of Processing Elements.

Opencl Anatomy

The platform layer API gives the developer access to routines that query for the number and types of devices in the system. The developer can then select and initialize the necessary compute devices to properly run their work load. It is at this layer that compute contexts and work-queues for job submission and data transfer requests are created.

The runtime API allows the developer to queue up compute kernels for execution and is responsible for managing the compute and memory resources in the OpenCL system.

OpenCL Memory Model
OpenCL defines four  memory spaces: private, local, constant and global.

Private memory is memory that can only be used by a single compute unit. This is similar to registers in a single compute unit or a single CPU core.

Local memory is memory that can be used by the work-items in a work-group. This is similar to the local data share that is available on the current generation of AMD GPUs.

Constant memory is memory that can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on AMD GPUs.

Global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs.

Terminology


The Execution Model

There are three basic components of executable code in OpenCL: Kernels, programs, applications queue kernels.

A compute kernel is the basic unit of executable code and can be thought of as similar to a C function.  Each kernel is called a work item, where each of which has a unique ID.

Execution of such kernels can proceed either in-order or out-of-order depending on the parameters passed to the system when queuing up the kernel for execution. Events are provided so that the developer can check on the status of outstanding kernel execution requests and other runtime requests.

In terms of organization, the execution domain of a kernel is defined by an N-dimensional computation domain. This lets the system know how large of a problem the user would like the kernel to be applied to.

Each element in the execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes.

Executing Kernels, Work-Groups and Work-Items

A program is a collection of kernels and other functions. So a group of kernels are called a program.

Applications queue kernels are queues of kernels which are queued in order and executed in order or out of order.

Since OpenCL is meant to target not only GPUs but also other accelerators, such as multi-core CPUs, flexibility is given in the type of compute kernel that is specified. Compute kernels can be thought of either as data-parallel, which is well-matched to the architecture of GPUs, or task-parallel, which is well-matched to the architecture of CPUs.

Data parallelism:
focuses on distributing the data across different parallel computing nodes.

To achieve data parallelism in OpenCL:

1.define N-Dimensional computation domain

  • Each independent element of execution in N-D domain is called a work-item
  • The N-D domain defines the total number of work items that execute in parallel — global work size.

2.Work-items can be grouped together — work-group

  • Work-items in group can communicate with each other
  • we Can synchronize execution among work-items in group to coordinate memory access

3.Execute multiple work-groups in parallel

example of data parallelism in OpenCL:
Data parallelism

Task parallelism:

focuses on distributing execution processes (threads) across different parallel computing nodes.

this can be achieved by synchronizing work items within a work group.

OpenCL Objects

  • Setup objects:
  1. Devices : Gpu, Cpu, Cell.
  2. Context : collection of devices.
  3. Queues : submit work to the device.
  • Memory objects:
  1. Buffers : Blocks of memory
  2. Image objects : 2D or 3D images.
  • Execution :
  1. programs.
  2. Kernels.

How to submit work to the computing devices in the system?

There are three basic steps to do this:
  1. compile the programs you wrote.
  2. set the arguments and parameters of each kernel  to the desired values and create memory objects and buffers .
  3. use command queues to en queue those kernels and send the code to execution.
After finishing the previous three steps , we must know the number and types of devices and hardware we have.
first you must query for the devices in the system using clGetDeviceIDS .
then create a context to put the devices in so that they can share data and communicate and this is achieved using clCreatContext.
the last thing you have to do is to create command queue to allow us to talk to these devices.
NB. a multi core device is considered one device.

Simple Example – Vector Addition Kernel

The following is a simple vector addition kernel written in OpenCL.You can see that the kernel specifies three memory objects, two for input, a and b, and a single output, c. These are arrays of data that reside in the global memory space. In this example, the compute unit executing this kernel gets its unique work-item ID and uses that to complete its part of the vector addition by reading the appropriate value from a and b and storing the sum into c.

Since, in this example, we will be using online compilation, the above code will be stored in a character array named program_source.

To complement the compute kernel code, the following is the code run on the host processor to:

  • Open an OpenCL context,
  • Get and select the devices to execute on,
  • Create a command queue to accept the execution and memory requests,
  • Allocate OpenCL memory objects to hold the inputs and outputs for the compute kernel,
  • Online compile and build the compute kernel code,
  • Set up the arguments and execution domain,
  • Kick off compute kernel execution, and
  • Collect the results.

FUTURE of OpenCL

It is really hard to decide if OpenCL will continue or not, but i think  that the future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion.Fusion is AMD’s forthcoming CPU + GPU product on one hybrid silicon chip.

This processor would be perfect for OpenCL, As that doesn’t care what type of processor is available; as long as it can be used.

Refrences

http://www.bit-tech.net/news/hardware/2010/05/15/amd-fusion-cpu-gpu-will-ship-this-year/1

http://en.wikipedia.org/wiki/SIMD

http://tech.icrontic.com/news/opencl-delivers-on-multi-core-potential/

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/

http://www.hpcwire.com/features/OpenCL-To-GPGPU-and-Beyond-36016144.html

http://ati.amd.com/technology/streamcomputing/intro_opencl.h#anatomy

http://www.youtube.com/watch?v=KFAkUsn4YPg&feature=related


Introduction While Moore’s Law continues to predict the doubling of transistors on an integrated circuit every 18 months, performance and power considerations have forced chip designers to embrace multi-core processors in place of higher frequency uni-core processors. As desktop and high-performance computing architectures tend towards distributed collections of multi-core nodes, a new parallel programming paradigm is required to fully exploit the complex distributed and shared-memory hierarchies of these evolutionary systems. Recently, a programming model has been developed that has the potential to exploit the best features of this distributed shared-memory architecture. Not only does this model promise improved runtime performance on distributed clusters of SMPs, its data and execution semantics support increased programmer productivity. This model is called the Partitioned Global Address Space (PGAS) model. The Partitioned Global Address Space (PGAS) paradigm provides both a data and execution model that has the potential to dramatically improve runtime performance and programmer productivity on multi-core architectures using shared memory. Memory Models There are 2 models for memory usage:

  1. Shared Memory Model.
  2. Distributed Memory Model

Shared Memory Model
The shared-memory programming model typically exploits a shared memory system, where any memory location is directly accessible by any of the computing processes (i.e. there is a single global address space). This programming model is similar in some respects to the sequential single-processor programming model with the addition of new constructs for synchronizing multiple access to shared variables and memory locations. Distributed Memory Model The distributed-memory programming model exploits a distributed-memory system where each processor maintains its own local memory and has no direct knowledge about another processor’s memory (a “share nothing” approach). For data to be shared, it must be passed from one processor to another as a message. Why PGAS? The PGAS is the best of both worlds. This parallel programming model combined the performance and data locality (partitioning) features of distributed memory with the programmability and data referencing simplicity of a shared-memory (global address space) model. The PGAS programming model aims to achieve these characteristics by providing:

  1. A local-view programming style (which differentiates between local and remote data partitions).
  2. A global address space (which is directly accessible by any process).
  3. Compiler-introduced communication to resolve remote references.
  4. One-sided communication for improved inter-process performance.
  5. Support for distributed data structures.

In this model variables and arrays can be either shared or local. Each process has private memory for local data items and shared memory for globally shared data values. While the shared-memory is partitioned among the cooperating processes (each process will contribute memory to the shared global memory), a process can directly access any data item within the global address space with a single address. Languages of PGAS Currently there are three (3) PGAS programming languages that are becoming commonplace on modern computing systems:

  1. Unified Parallel C (UPC)
  2. Co-Array Fortran (CAF)
  3. Titanium

Unified Parallel C (UPC) Its an extended parallel extension of ANSI C with a distributed shared memory parallel programming language. Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C. The UPC provides standard library functions to move data to/from shared memory which can be used to move chunks in the shared space or between shared and private spaces. UPC Execution Model A number of threads working independently in SPMD (Single Process, Multiple Data) fashion. MYTHREAD specifies thread index (0..THREADS-1) and the number of threads specified at compile time or run time. No implicit Synchronization among the threads, only when needed. There are 4 mechanisms:

  1. Barriers: for blocking and non-blocking.
  2. Locks: to protect data against multiple writers.
  3. Memory consistency control: has to do with the order of shared operations.
  4. Fence: equivalent to null strict reference to ensure that all shared references are issued.

A quick Example //vect_add.c #include <upc_relaxed.h> #define N 100*THREADS shared int v1[N], v2[N], v1plusv2[N]; void main(){ int i; for(i=0; i If (MYTHREAD==i%THREADS) v1plusv2[i]=v1[i]+v2[i]; } UPC Runtime model
The figure shows the high-level system diagram for a UPC application compiled using the Berkeley UPC compiler. The generated C code runs on top of the UPC runtime system, which provides platform independence and implements language-specific features such as shared memory allocation and shared pointer manipulation. The runtime system implements remote operations by calling the GASNet communication interface, which provides hardware-independent lightweight networking primitives. UPC Memory model
A shared pointer can reference all locations in the shared space, while a private pointer may reference only addresses in its private space or in its portion of the shared space. Static and dynamic memory allocations are supported for both shared and private memory. UPC pointers There are 4 different ways for declaring pointers in UPC, each way declare a different type of pointer

  1. Int *p1; This is a private pointer pointing locally. it could be used to access private data or local shared data.
  2. Shared int *p2; This is a private pointer pointing in to shared space. it could be used for independent access of threads to data in shared space.
  3. Int *shared p3; This is a shared pointer pointing locally, but its not recommended.
  4. Shared int *shared p4; This is a shared pointer pointing to the shared space. it could be used for common access of all threads to data in shared space.

Co-Array Fortran (CAF) The CAF is a simple extension to Fortran 90 that allows programmers to write efficient parallel applications using a Fortran-like syntax. It also assumes the SPMD programming model with replicated data objects called co-arrays. Co-array objects are visible to all processors and each processor can read and write data belonging to any other processor by setting the index of the co-dimension to the appropriate value. The CAF creates multiple images of the same program where text and data are replicated in each image. it marks some variables with co-dimensions that behave like normal dimensions and express a logical problem decomposition. It also allows one sided data exchange between co-arrays using a Fortran like syntax. On the other hand, CAF requires the underlying run-time system to map the logical problem decomposition onto specific hardware. CAF Syntax The CAF syntax is a simple parallel extension to normal Fortran syntax, where it uses normal rounded brackets () to point data in local memory, and square brackets [] to point data in remote memory. CAF Execution Model The number of images is fixed and each image has its own index, retrievable at run-time. Each image executes the same program independently of the others and works on its own local data. An image moves remote data to local data through explicit CAF syntax while an “object” has the same name in each image. The programmer inserts explicit synchronization and branching as needed. CAF Memory Model There are 4 memory models:

  1. One-to-one model.
  2. Many-to-one model.
  3. One-to-many model.
  4. Many-to-many model.

What do co-dimensions mean? real :: x(n)[p,q,*]

  • Replicate an array of length n, one on each image.
  • Build a map so each image knows how to find the array on any other image.
  • Organize images in a logical (not physical) three dimensional grid.
  • The last co-dimension acts like an assumed size array: *
  • A specific implementation could choose to represent memory hierarchy through the co-dimensions.

CAF I/O There is one file system visible to all images, where an an image can open a file alone or as a part of a team. The programmer controls access to the file using direct access I/O and CAF intrinsic functions.
Titanium The Titanium is based on java but on compile, its first compiled to C then to machine code. It has the same SPMD parallelism model as UPC and CAF but dynamic java threads are not supported. The Titanium analyzes global synchronization and optimizes pointers, communication and memory. Titanium’s global address space is based on pointers rather than shared variables.There is no distinction between a private and shared heap for storing objects. Any object maybe referenced by global or local pointers. Titanium features over java

  • Multi-dimensional arrays: iterators, sub arrays, copying.
  • Immutable “value” classes.
  • Templates.
  • Operator overloading.
  • Scalable SPMD parallelism replaces threads.
  • Global address space with local/global reference distinction.
  • Checked global synchronization.
  • Zone-based memory management (regions).
  • Libraries for collective communication, distributed arrays, bulk I/O, performance profiling.

Titanium Execution Model Titanium has the same execution model as UPC and CAF. Basic java programs maybe run as titanium programs, but all processors do all the work. Eg. Parallel hello world: class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc” + Ti.thisProc() + ” out of ” + Ti.numProcs()); } } Titanium Runtime Model The latest versions of Titanium include distributed-memory backends that communicate using GASNet, a high-performance communication interface designed especially for SPMD global address-space languages like Titanium (and UPC) that offers better portability and higher-level operations which can leverage hardware-specific features that support a global-address space model. Titanium also supports using Active Messages 2.0 as the standardized networking interface for some of the older cluster-based parallel backends. Active Messages is a low-level, high-performance communication paradigm first proposed by von Eicken et al. that basically amounts to a super-lightweight RPC mechanism, which is generally implemented as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware. Titanium uses several different AM 2.0 implementations for various backends: Lanai AM AMMPI AMUDP AMLAPI Titanium memory model Globally shared address space is partitioned, where pointers are either local or global. Global pointers may point to remote locations. Conclusion

  • UPC is easy to program in for C writers, significantly than alternative paradigms at times.
  • UPC exhibits very little overhead when compared with MPI for problems that are parallel.
  • The CAF syntax gives the programmer more control and flexibility.
  • Co-dimensions in the CAF match any problem decomposition.
  • The CAF performance is better than the library based models.

The titanium has all the benefits of java plus all the features that has been added to handle parallel programming


In today’s world mutli-core architectures are becoming more and more popular, but with a different architectural concept comes a new programming model. Implementing parallel programs with tools that are not specifically designed for parallelism can be very tricky and usually fails after an application starts to grow bigger. OpenMP is one of the leading multi-programming APIs nowadays.

Programmers Motivation

To demonstrate the difficulty of parallelism in coding the simple example shown below has been prepared.

Incrementer: Increments all members of an array.

Given the basic task of incrementing all elements of a large array it is additionally required to add parallelization to make use of existing multi-core hardware. Noticing that all elements are independent of each other, a first step would be to divide the loop into segments. Rather than incrementing each element in sequence, sets of iterations should be formed and incremented in parallel. This can be achieved through threading. Each thread will work on a portion of the array while the operating system will abstract further scheduling details. The concept of distributing the task over n threads yields its own problems already. Many questions arise:

What is the most suitable size for n?

What specific task will each thread carry out (can it be generic)?

Is parallelization always desired?

How will a larger application handle the big amount of threads?

OpenMP solves most of the previously mentioned problems.

What is OpenMP?

Open: Open Specification

M: Multi

P: Processing

OpenMP is an API available for C/C++ and Fortran, which runs on most platforms. It allows the programming of shared memory architectures and aims at code parallelization to improve performance on multi-core computers.

OpenMP was implemented by a set of large hardware manufacturers such as Intel, IBM and Digital. A forum was created in order to exchange ideas for a new standard in shared memory computing, which did not exist by this time. Work designed for a certain platform was not reusable on other platforms. OpenMP was first released in 1997 and is nowadays compatible with most computer platforms.

Main components of OpenMP

Open Extensions: http://en.wikipedia.org

While parallel control structures are used to parallelize some code, work sharing is used for distributing work (e.g. breaking up loops). The data environment defines the scope of variables within parallel regions. Synchronization is a critical element and describes the process of thread ordering/sequencing in a chronological sense. Runtime routines allow an application to interact with the environment, like retrieving platform specific information. The OpenMP architecture has the following structure.

OpenMP Architecture: http://www.lrz-muenchen.de

OpenMP works with threads which are handled by the operating system. Programmers can put directives in their code to indicate parallelization at certain points, while the environment variables determine runtime behaviour (resource allocation, scheduling algorithm, number of threads).

Directives in OpenMP (specifically for C)

The general form of a directive in C is: #pragma omp <rest of pragma>.

Note: pragmas are ignored by unknowing compilers.

Example usage of directives follows:

#pragma omp parallel {}: Parallelization of the code enclosed by the braces.

#pragma omp parallel do/for: Parallelization of a do/for-loop.

Parallel Control

Open MP & Threads: http://cs.calvin.edu

The parallel directive parallelizes a piece of code as show in the example below. By its own it does not distribute the load among the threads, but much rather runs a copy of the code on each available thread.

Hence in the following example the expected output is a message from each thread in the form of “Thread no: x”. If it is required to share the work between threads, work sharing is in order.

Parallel directive in OpenMP

Work Sharing

Incrementer: Implementation in OpenMP

To provide an actual work sharing example, the original incrementer problem has been modified. With the omp parallel for directive, the work is distributed over several threads. This is incredibly powerful, since existing code can be easily modified to run in parallel. Given that pragmas are ignored by compilers that do not support an OpenMP extension, the code can at the same time be interpreted as code written for a single core computer. Work sharing can also be applied on whole blocks of code (other than just loops). These blocks are referred to as sections. The following example shows a server application with three main tasks that require parallelization.

OpenMP Server Example

The server should listen to incoming connections, update the IP address on a dynamic DNS server and capture some streaming video input. Using the section keyword this can be achieved easily.

Loop Scheduling in OpenMP (Load Balancing)

Load Balancing is one of the most important factors in parallel computing. It ensures that all threads stay as busy as possible in order to make use of the computational power that is available. Some iterations within a loop may need significantly more time than other iterations. Normally OpenMP assumes that all iterations of a loop require the same amount of time and hence schedules approximately equally sized chunks to the threads. This approach minimizes the chances of memory conflicts (false sharing), on the other hand it may come at a high cost in terms of load balancing, since a particular set of iterations may require more time than others. The opposite approach yields the opposite advantages and disadvantages. Under normal circumstances the programmer knows which division of the iterations may optimize the execution time (or at least can benchmark it). Therefore it is often desired to decide over how tasks should be scheduled to the available threads. OpenMP supports a set of scheduling directives that allow a programmer to do so. The general form is: schedule(type, chunk size). There are four types of scheduling. STATIC scheduling divides iterations into k pieces, each of size “chunk_size” which are then scheduled to the available threads in a round-robin fashion. This has some severe problems if we consider that some threads may finish earlier but are staying idle because it is not their turn. To overcome this problem DYNAMIC scheduling can be used. It schedules standing by chunks to the first thread that finished work. GUIDED scheduling uses dynamic scheduling but does not divide iterations into equivalently large chunks, instead it divides all remaining iterations by the number of threads and schedules one chunk to one thread then divides the remaining iterations by the number of threads again etc. until the a chunk has reached chunk_size (basically it exponentially decreases to chunk_size). RUNTIME scheduling simply checks for the OMP_SCHEDULE environment variable to decide over the scheduling algorithm.

The data environment

In parallel regions it is important to decide over the scope of variables. It makes significant difference to declare a variable as shared among all threads or as a private variable for each thread. Shared variables can modified and read by all threads. A change made by a thread will affect the value seen by all threads. In some cases this is important, for instance when a variable is used to enable communication between various threads. In contrast to that variables may required to be of private nature to a thread (e.g. a loop counter). Since private variables can be seen as new instances further complications arise: With what value will it be initialized inside every thread and what will be the value of a variable once the parallel region is exited? There are three directives to handle this issue. FIRSTPRIVATE is used to initialize all “copies” of a variable with its value before entering a parallel region. LASTPRIVATE makes sure that the last thread that finishes executing writes its value back into the variable. Sometimes it is required to express the resulting value as a function of the last values within all threads.

To do so the REDUCTION(op:variable) directive can be used, where op is an operator (+, *, -, &, ^, |, && or ||) which does not involve any possible precedence issues and variable is the variable to which the reduction should be applied.

Synchronization

When threads are not synchronized and do not hold a chronological order required by a program, this can have an undesired impact on other threads and hence the entire program. Thread synchronization is a critical part in parallel programming and there are several techniques to properly synchronize them, three of which will be described here. To force only one thread to run at a time the CRITICAL directive can be used. Imagining a problem with multiple threads and a global variable v, for which it is essential that its value is modified and read by multiple threads in a fixed order within the entire program, will make the importance of having an only-one-thread at a time policy quiet comprehensible. Sometimes a certain order is required but with no global nature. Thinking of a loop with flow-dependency (e.g. array[i] = array[i-1] * I), there needs to be sequencing directive with a local characteristic. ORDERED can take care of such a case. It sorts thread execution of loop iterations in the same way they would execute in a sequential program.

To only make threads wait for another after continuing with different work the BARRIER directive can be placed within a parallel region.

Thread count

The number of threads very much depends on the application. When the targeted applications do mainly calculations it is recommended to use as many threads as there are processors. This is reflected by the following graph which demonstrates the performance vs. the thread count on a quad core processor.

Computation of pi; Speedup related to CPU vs. Thread count: http://www.math.utah.edu

I/O on the other hand tends to behave quiet differently in this aspect. The following graph indicates that and increased number of threads does improve throughput of I/O devices.

Process Count vs. throughput: http://www.open-mag.com

Conclusion: The correct thread count depends on the application.

Hybrid model

OpenMP tends to have a serious issue with a great number of threads. An additional problem of OpenMP is that it only runs on shared memory systems.

MPI & OpenMP Hybridization: http://www.erdc.hpc.mil

A model which combines OpenMP with MPI (Message Passing Interface) has evolved recently. It uses the best of both. While MPI supports distributed memory systems and is capable to handle many threads at the same time, OpenMP is easier to implement and to maintain. In the hybrid model OpenMP is used to subdivide a program further into threads to increase performance, while MPI is the backbone of the program.

Pros & Cons of OpenMP

Advantages:

Easy to program / easy to maintain.

Supports parallelization and sequential code execution.

Disadvantages:

Runs only on shared memory computers.

No direct GPU support.

Needs a compiler with OpenMP support.

Fails when using a large number of threads.

Performance of OpenMP vs. other models

This graph shows a benchmark that was tested on an 8-core computer running Linux. When the application is coarse-grained the performance of OpenMP is very satisfying.

Once the same problem is fine-grained, OpenMP fails, which may very well be due to the fact that OpenMP is not good at handling communication between too many threads

The next benchmark shows the Smith-Waterman algorithm for sequence matching. During this benchmark OpenMP always remains at top positions in performance.

Link to the original presentation: Presentation

To learn more about OpenMP, check the following links:

Official OpenMP website, also check out the links provided there.
OpenMP Tutorial , by Blaise Barney, Lawrence Livermore National Laboratory.
cOMPunity , OpenMP community.
OpenMP 3.0 specifications


Introduction

Server Overload

Lets say you have a company and you have a website, that website exchanges conversations between members. At first you start by a few users talking to each other. Then those users start to tell their friends about your website. And those friends start to tell other friends about your website, and so on. Till your servers can’t keep up with the demand. In order to work around this problem you either buy or rent some servers and set them up or ask someone else to set them up for you. Of course this approach costs a lot of money and takes a lot of time.

Now there is a better option, instead of hosting your website on a your dedicated server, you can host it on the Cloud. With cloud computing you have access to computing power almost instantly when you need it. So if you suddenly need more computing power you can scale up as much as you need in the cloud on demand. Thus, the clients won’t notice any difference. In general, with cloud computing you don’t focus on hardware or software, you only focus on what you need.

What’s Cloud Computing

Cloud Computing

A cloud computing is internet-based computing where shared resources, software and information are provided to computers and other devices on demand-like a public utility. “Wikipedia”

Orcale CEO Larry Ellison said: “Cloud Computing is using a computer that’s out-there”. The cloud in cloud computing provides the means through which everything – from computing power to computing infrastructure, applications, business processes to personal collaboration – can be delivered to you as a service wherever you need.

The term “cloud” is used as a metaphor for the “Internet” which is an abstraction of the infrastructure it represents. Cloud computing typically provide some online services which can be accessed using a web service or a browser while software and data are stored on servers.

Cloud Computing Characteristics

In general, Cloud computing users or customers(us) do not own the physical infrastructure. Thus, we don’t have to worry about the machines that runs the application because they don’t belong to us, so we don’t have to devote out time and resources for maintaining an application. On the other hand, all the hardware and software are on the cloud. You just subscribe for the resources you need as a service and only pay for the resources that you use. This approach leads to that overall computer usage rises dramatically as customers don’t have to engineer for the load limits.

Cloud Computing Key Features

Cloud computing has many features, some of the features are:

– Cost

Cost is claimed to be greatly reduced as you don’t need to buy any hardware or software, you only charged on the amount of resources you use. An example of a free cloud-computing service is GooGle Docs (http://docs.google.com/). With google docs you can create documents, spreadsheets and presentations online. So you don’t need to buy Microsoft office for windows or IWork for mac. All you need is a PC that’s capable of running a browser and internet connection.

-Virtualization

Virtualizing an application means packaging the application with everything it needs to run including database, middleware and operating system to be a self-contained image that can run anywhere on the cloud.

As we know, the cloud can offer the software as a service, but applications are so hard to deploy on new environments (to meet their requirements). So people thought of why not to separate the application from the infrastructure! So application and infrastructure become independent.Virtualization separates resources and services from the underlying physical delivery environment. Allowing servers to be easily shared by many applications.

-Multitenancy

Multitenancy is the opposite of virtualization, as in multitenancy environment, multiple customers share the same application, running on the same OS, on the same hardware. The distinction between the customers is achieved during application design, so they don’t share or see each other data. Multitenancy also improves the utilization of the systems as many customers will be using the same hardware.

-Scalability

Regarding hardware, its the ability to go from small to large amounts of processing power with the same architecture. Regarding software products such as databases, it refers to the consistency of performance per unit of power as hardware resources increase.

The service provider can’t anticipate how customers will use the service. One customer might use the service three times a year during peak selling seasons, whereas another might use it as a primary development platform for all of its applications. Therefore, the service needs to be available all the time and it has to be designed to scale upward for high periods of demand and downward for lighter ones. Scalability also means that an application can scale when additional users are added and when the application requirements change.

-Reliability

Improves through the use of multiple redundant sites and the ability of a system to provide uninterrupted service despite the failure of one or more of the system’s components, this’s called “Fault Tolerance” which is failure prevention.

-Maintenance

Since there is a third party that owns the hardware and software and just provide them as a service, maintenance isn’t our problem,the cloud service provider is the one responsible for hardware maintenance and software upgrades/maintenance.

Cloud Computing Layers

Cloud Computing Layers

Cloud computing consists of five layers:

  1. Client.
  2. Application.
  3. Platform.
  4. Infrastructure.
  5. Servers.

Cloud infrastructure services or “Infrastructure as a Service (IaaS)” delivers computer infrastructure, typically a platform virtualization environment as a service. Rather than purchasing servers, software, data center space or network equipment, clients instead buy those resources as a fully outsourced service. The service is typically billed on a utility computing basis and amount of resources consumed will typically reflect the level of activity. It is an evolution of virtual private server offerings.

The Cloud Infrastucture consists of one or more servers. The servers layer consists of computer hardware and computer software products that are specifically designed for the delivery of cloud services, including multi-core processors, cloud-specific operating systems and combined offerings.

There are two different types of services that a cloud can offer:

  1. Saas (Software as a Service): GooGle is one of the free Cloud providers that provides software as a service (Google apps).
  2. HaaS (Hardware as a Service): Amazon’s Ec2 is one of the providers that offers different computing power models (See Amazon’s EC2 Section).

Amazon’s EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. It provides a basic measure of an EC2 compute unit: One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Amazon’s EC2 executes an application on a virtual computer that’s called an instance. You have the choice of several instance types, allowing you to select a configuration of memory, CPU, and instance storage that is optimal for your application.

Amazon EC2 instances are grouped into three families: Standard, High-Memory and High-CPU. Standard Instances have memory to CPU ratios suitable for most general purpose applications; High-Memory instances offer larger memory sizes for high throughput applications, including database and memory caching applications; and High-CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family and size.

Amazon EC2 also currently supports a variety of operating systems including: RedHat Linux, Windows Server, openSuSE Linux, Fedora, Debian, OpenSolaris, Cent OS, Gentoo Linux, and Oracle Linux.

Standard Instance

—They are well suited for most applications.
Standard Instance

High-Memory

—They offer large memory sizes for high throughput applications, including database and memory caching applications.

High Memory Instance

—

High CPU Instance

—They offer more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

High CPU Instance

Cacutus

Cacutus

Cacutus is another cloud computing service provider (SaaS). It’s a complete network graphing solution designed to handle statistics and signal processing data like network bandwidth, temperatures, CPU load etc. it stores all of the necessary information to create graphs and populate them with data in a MySQL database.

References

Materials

PowerPoint Presentation


Since the time developers found out, that increasing the frequency of the processor in order to increase its total performance, would not be efficient (or would not function at all), the research has been shifted to increasing the number of processors on the dye and enhancing communication between them, which started the multiprocessors “movement”.

The newer trend was to design multiprocessors composed of processors, that differ in the architecture (heterogeneous MP), as we have seen in previous presentations/blog posts with the IBM Cell Processor, rather than having a homogeneous multiprocessor.

The latest trend is to offload certain applications or instructions from the CPU to the GPU, creating the GPGPU “movement”.

Why GPGPUs?

GPUs have a huge advantage over the CPU which is described in the following chart:

(Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

GPUs are optimized for Floating Point Operations, as well as being optimized for parallelism. On typical Multiprocessor consists of 4 cores (as with most of the Intel i-Series multiprocessors), while a GPU is composed of tens of processors. This is because CPUs could be considered memory based processors while GPUs could be called ALU based, which allows the GPU to perform more operations in parallel resulting in the high GFLOPS (FLOPS = Floating Point Operation Per Second) compared to the CPU.

Basic structure of a typical CPU (left) and GPU (right) (Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

What is CUDA?

CUDA is the missing link between the developer and the GPU. It was developed by NVIDIA and is implemented in all NVIDIA GPUs starting the G80s. Before having programming architectures dedicated to programming the GPU, a programmer had to choose either between dealing with the complex APIs of the GPUs or “tricking” it by passing a texture, that contains the data or the instructions, to the GPU and then receiving a the data in the form of a texture, which typically creates a lot of overhead.

CUDA processors are programmed in CUDA C, which is basically C/C++ with some CUDA extensions, which will be mentioned and explained later on. It is important to know that in early versions of CUDA the GPU had to be programmed in C, while the CPU could be programmed in either. This is important when writing code, since the developer must know at all times, whether the code is compiled for the CPU or the GPU. Starting from CUDA 3.0 more C++ features had been enabled for the code compiled for the GPU.

CUDA Structure and Terminology

Thread : The smallest unit executing an instruction.

Block : Contains several threads.

Warp : A group of threads physically executed in parallel (usually running the same application).

Grid : Contains several thread blocks.

Kernel : An application or program, that runs on the GPU.

Device : The GPU.

Host : The CPU.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

Addressing

Threads and Blocks need to have a unique ID in order to access them while writing code. This is important, since threads and blocks are the main components when it comes to writing efficient parallel code. Within each block a thread is uniquely accessible  through the Thread ID, which is an integer between 0 and n, where n is the total number of threads within the same block. In a more complex approach – when dealing with 2D or 3D blocks – the Thread ID is calculated as a function, rather than having a fixed integer, which represents the number of the thread inside the block. Inside a 2D Block threads are positioned at (0,0), (0,1), … (n-1, n-1), as shown in the previous figure. Since the Thread ID must be of type uint3 – which will be considered a normal integer for now –  something like (3,2) for the Thread ID is not applicable.

The function for getting the Thread ID in a 2D Block is:

x + y * Dimx,

where ‘x’ and ‘y’ are the x and y indices of the thread and Dimx the x dimension of the block (in the case of the figure Dimx would be equal to 5).

In a 3D block the same applies with the difference of dealing with one more dimension which results in the function:

x + y * Dimx + z * Dimx * Dimy,

where x, y and z are the indices (same as for the 2D block) and Dimx and Dimy are the x, y dimensions of the block.

When addressing a block within a grid the same rules of addressing a thread in a block apply. We do not deal with 3D grids in CUDA, though.

Memory Model

Each thread has a private register set – which are accessible in 0 clock cycles – and local memory, which is the fastest accessible memory from the thread. Each block has  a shared memory, which can be written to/read from by all threads in the same block (hence, the name shared memory). The shared memory is the best and fastest way of communication for the threads. It is expensive to store data in the shared memory though, due to its relatively small size, therefore only variables which are needed by all the threads should be stored in it. Each grid has a global memory, which is accessible from all the blocks inside it (and therefore also the threads inside all the blocks). On a higher level each processor has its own cache and the whole device (the GPU) has a DRAM.

Of course the higher the level of the memory the bigger its size and proportionally to that is also the cost (in time units or clock cycles) to access it (be it read or write – if allowed). This is due to the increased distance from the unit trying to access the memory to the memory itself, and the latency from the memory.

Execution

It is always the most beneficial to for the developer to run a block on the same processor. Consider an example, where we have a 5 x 5 block being executed, where 25 processors are idle (this is just an example, it is unlikely -almost impossible – to have such a big number of idle processors), which means that theoretically each thread could be run by a single processor. Since the running program is a program optimised for parallelism (or else it would not make much sense to run it on a GPU and it would make even less sense to share it among 25 threads), there is a lot of data that must be shared between the threads. And since the threads run on different processors this data cannot be put in the shared memory of the block and would have to go to a higher memory (the DRAM in the worst case), and each thread will have to access this memory to retrieve the data need for operation. And just a simple comparison: accessing the shared memory as well as the global memory costs about 4 clock cycles. Accessing the global memory consumes about 400 – 600 more clock cycles for memory latency.

Hence, a block, that is divided onto several processors in most of the cases would result in a better execution time BUT the time needed to fetch data from the memory and the related idle time would result in a much worse performance than when running the block on the same processor.

What actually happens is that the processor in charge of a certain block divides the threads into warps, which are then executed in parallel. Each warp in the block is given a share of the execution time until the whole kernel is executed.

CUDA Extensions

There are four main extensions (excluding custom libraries) done to C/C++ that create the CUDA C.

1. Function Type Qualifiers

These qualifiers are written while declaring a function to decide whether the function is executed and called from the device or the host. There are three qualifiers:

  • __device__: the function is called and executed on the device
  • __shared__: the function is called from the host and executed on the device
  • __host__: the function is called and executed on the host

An function with a CUDA qualifier would have this form:

__device__ function_name (parameters),

where of course “__device__” can be replaced by any of the function qualifiers.

The two qualifiers __host__ and __device__ could be combined to create a function, that is compiled for both the host and the device.

2. Variable Type Qualifiers

Similar to the function qualifiers, the variable qualifiers decide the lifetime of a variable and in which memory it is stored.

  • __device__: the variable has the lifetime of the application and is stored in the global memory, which makes it accessible from all blocks within the grid
  • __shared__: the variable has the lifetime of the block and is stored in the shared memory, hence accessible from all threads within the block

3. Execution Configuration

When calling a global function the dimensions of the grid and the block in which this function is to be executed must be specified. This is called the execution configuration.

<<GridDimension, BlockDimension, Ns, S>>

GridDimension: Dimension of the Grid.

BlockDimension: Dimension of the Block.

Ns (optional): how much memory to allocate for this function.

S (optional): specifies the stream associated with this function.

4. Built – in Variables

There are four variables, that have been introduced to extend the C/C++ language. All of them are mainly associated with Thread/Block addressing:

  • gridDim -> specifies the dimension of the grid. Type: dim3
  • blockDim -> specifies the dimension of the block. Type: dim3
  • BlockIdx -> the unique Block ID. Type: uint3
  • ThreadIdx -> the unique Thread ID. Type: uint3

Compiler

The Compiler used to compile CUDA C code is a PathScale Open64 Compiler also known as NVCC. Open64 is an open source compiler developed under the GNU License. PathScale Open64 has been further developed by the company PathScale specifically for x86-64 and Itanium processors, which is the main reason why it is optimised for parallelism.

Two important functions of the compiler are #pragma unroll, and _use_fast_math, which both result in better performance when used in coding.

#pragma unroll x: when written before a loop the following loops are then unrolled depending on the optional number ‘x’ following the #pragma unroll. There are three cases.

  1. x = 1, the following loop is not unrolled.
  2. x = n, where 1 < n, the loop is unrolled for n loops
  3. no x, the whole loop is unrolled

-use_fast_math is useful whenever the developer cares more about the performance of the written code rather than its accuracy. -use_fast_math enhances the performances by doing faster maths calculation by decreasing the accuracy of the results. Functions in -use_fast_math mode usually start with a double underscore: __mathfunction(); .

Built-in Types

The built-in types are:

char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4,  int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4,

where the first part indicates the type as we know it (in C, C++ or java), char, int, float, … , and the second part stands for the size of this type. So, for instance considering the type float4, it consists of a structure with 4 substructures which are the actual values (the actual floats). Each of those floats are accessible through the variables x,y,z and w.

As an example, we declare a variable of type float4:

float4 make_float4 (x, y, z, w);

to access x, y, z or w we write variable_name.x, variable_name.y, variable_name.z or variable_name.w, where variable_name should be replaced by the actual name of the variable, of course.

Example Code (serial -> parallel)

The following is an example of a simple code in CUDA C. The function takes as input an integer n, a float a and two float pointers x and y, and then stores in each cell in y the value a*x + y. Shown is also how this function is called.

(Source: SuperComputing Tutorial 2009 Introduction by David Luebke)

To convert this code to parallel we need to work on two things. First, we need to divide the work on the threads, so each thread could execute one of the y[i] = a*x[i] + y[i] operations. To do so each thread needs to be uniquely addressed and given a unique instruction to execute. The second thing is the function call: we need to add the execution configuration (mentioned earlier) to the function to know how many blocks and threads to allocate for this function. The following figure shows the code in implemented in parallel.

(Source: SuperComputing Tutorial 2009, Introduction by David Luebke)

In the parallel implementation of saxpy (Scalar Alpha X Plus Y) int i serves as the ID for the threads and it also corresponds to the location in the array to be read from. So, thread i reads of the i-th position of array x and y. As for the execution configuration, the function is given nblocks for the number of blocks, where nblocks is an integer depending on n, and the number of threads within each block is fixed to 256 threads per block.

Optimisation

To execute a single thread, what happens is that each thread reads the instructions off the memory (global or shared, depending on the instruction) and then the actual execution happens, and then the result is written back to the memory. Which means, that there is constant reading and writing from and to the memory. This is why the throughput heavily depends on the bandwidth between the threads and the memories and also the bandwidth between the CPU and the GPU. Another important aspect, that always blocks the performance is the memory latency, and as discussed before the memory latency increased the higher the memory is (global memory has a higher latency than shared memory), this is why it is smart to avoid accessing higher level memories whenever possible. As said before, accessing the shared and the global memory takes 4 clock cycles, while 400 – 600 more clock cycles are consumed due to memory latency. Sometimes it is even more beneficial to recompute data rather than caching it due to the memory latency.

Another thing bad for the total performance of the code are “if, while, do, for and switch” statements, that is because they diverge the execution path for the warps. A diverged warps are no longer executed in parallel, they must be serialised, adding to that after serialisation the diverged warp must be synchronised, which adds more instructions to be executed. Possible solutions for these statements are unrolling (as discussed earlier) and using branch prediction instead of ‘if’ and ‘switch’ statements whenever possible.

Coalesced Memory

Another important thing to take care of while writing code is coalesced and uncoalesced memory. A coalesced memory is one, where each thread reads off the address that corresponds to it. So if the base address of a certain block is n, then any thread i inside this block must access the address at: (n + i) * type_of_read, where type_of_read must be 1, 4 or multiples of 16. Any scheme other than that results in an uncoalesced memory. The following figure shows a coalesced memory.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

Both scenarios are examples of coalesced memories, whereas the right part of the figure is an example of a coalesced memory, where some threads do not participate, which results in a (relatively) insignificant worse performance.

An example of a uncoalesced memory is shown in the next figure.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

In the left example Thread 3 and Thread 4 are reading off of the wrong addresses, while in the right example all threads are shifted by 4 bytes, since the base address is 128, so Thread 0 should be reading off of address 128.

To have a concrete and strong argument why coalesced memory outperforms uncoalesced memory we take a look at the results of reading 12M of floats in three ways:

1. coalesced: 356 us

2. coalesced (some threads do not participate): 357 us -> that is why the decrease in performance was labelled as “(relatively) insignificant”

3. uncoalesced: 3494 us

Banks

The shared memory is the fastest memory in a CUDA processor following the local memories of the threads. This is a result of dividing the shared memory into banks, which allows threads to access the shared memory simultaneously. An occurring problem due to banks is bank conflicts which is the result of either two or more threads trying to access the same bank, or accessing an element, that has not equal to 32 bytes.

The first problem causing bank conflicts is obvious, a bank can only serve a single thread at a time, when two or more threads try gain access, one thread is served and the rest of the threads are serialised. As for the second problem consider the following: an array of ELEMENTS is stored in the shared memory, where the size of ELEMENTS is 8 bytes, which means, that 4 ELEMENTS are stored in one bank. Assuming thread i is accessing ELEMENTS[i], which is stored in bank number j, and thread i+1 is accessing ELEMENTS[i+1]. Typically, when dealing with a 32 byte element ELEMENTS[i+1] would be stored in bank number j+1. But since we said, that ELEMENTS is of size 8 bytes ELEMENTS[i+1] is stored in the same bank, which is bank number j just as ELEMENTS[i]. So, at the end both thread i and thread i+1 try to access the same bank, although for different elements.

CUDA Libraries

CUBLAS: CUDA accelerated Basic Linear  Algebra Subprograms

CUFFT: CUDA Fast Fourier Transform

MAGMA: Matrix Algebra on GPU Multicore  Architectures

CULA: implementation of LAPACK interface

CUDA Tools

CUDA – gdb Debugger

CUDA – Memory Checker

CUDA – Visual Profiler

NEXUS – NVIDIA IDE

More than C/C++

In order to allow a wider spectrum of developers to code in CUDA, the range of languages, that can be compiled to CUDA has been broadened. There exist converters from Fortran to CUDA (NOAA F2C-ACC), Python to CUDA (PyCUDA) and Java to CUDA (jaCUDA). Unfortunately code generated using these converters is not fully optimised, which means, that some manual optimisation is still needed to generate optimised CUDA code, when dealing with CUDA converters.