programming models


In my earlier blog post I quickly went through the perspective of the CPUs and GPUs to scale out their performance. I also mentioned how the APU is trying to harness the goodness of both worlds. Let me quickly this time go through a simple example and show and the APUs would present an excellent platform to solve this problem.

Consider the problem of parallel summation across a very large array. How would you solve this problem on a CPU? Here is the pseudo code:

  1. Take an input array.
  2. Block it based on the number of threads (usually one per core – 4 or 8 cores).
  3. Iterate to produce a sum in each block.
  4. Reduce across threads.
  5. Vectorize your execution step through the SIMD ISA.

Have a look at the code below

  1. //Summation Across all threads
  2. float4 sum(0,0,0,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. float scalarSum = sum.x +sum.y + sum.z + sum.w;
  6. //Reduction stage to aggregate threads results
  7. float reductionValue(0);
  8. for (t <= threads_num)
  9.  reductionValue += t_sum;

Think now of an efficient implementation on the GPU:

  1. Take the input array.
  2. Block it based on the number of threads (16 per core it could be up to 64 per core).
  3. Iterate to produce a sum in each block.
  4. Reduce/Sum across threads.
  5. Vectorize through a different kernel call due to the limitations of the current execution models.
  1. //Summation Across all threads
  2. float64 sum(0,…,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. //Reduction stage to aggregate threads results
  6. float reductionValue(0);
  7. for (t <= threads_num)
  8.  reductionValue += t_sum;

They don’t look so different from each other, right? Basically you do the same steps but the main differences are the number of cores and the number of threads. On GPUs you have more way more threads to do the summation, which may complicate your model. In addition, these many threads bring with them a lot of state management overheads, context switching, and problematic stack management. On the CPU cores you may have data parallelism through the limited number of cores and threads. Narrow SIMD units simplify the problem. High clock rates and caches make serial execution efficient for each single thread. Also the simple mapping of tasks to threads allows us to create complex tasks graphs. However, this comes at the cost of many iterations for loops. So in other words, GPUs support very fine-grained data parallel execution and CPUs provide coarse-grained data parallel execution model.

APUs combine these by providing a nested data parallel code. Basically, CPUs take coarse-grained tasks and break them down to the on-chip GPUs to do faster execution of finer grained tasks. Close coupling of the CPUs and GPUs elemenates the cost of moving data between them to execute this nested data parallel model. Also, CPUs can handle conditional data parallel execution much better than GPUs; offloading computations becomes more efficient since there is virtually zero data copying for this offloading process.

Applications can now combine high and low degree of threading at almost zero cost. Also, interesting execution models are possible. You can have multiple kernels execution on the simultaneously communicate through shared buffer and relatively low synchronization overhead. So back to our example, we can now divide our array to the four CPU cores and each core then can offload the summation to the GPU threads, do the reduction at its level, and then all the CPUs can synchronize and do the reduction with very low overhead.

So, this is in terms the possibilities on the APU architecture.

The question now is: how can we easily use all these capabilities without scarifying performance? Moving from the explicit data movement between CPUs and GPUs to the shared memory spaces is tricky. CPUs use explicit vectors ISA and memory access patterns, but GPUs depend on implicit vectors through multiple threads scheduled to access adjacent memory locations simultaneously. How can these two models be targeted in an easy clear programming model with an acceptable efficiency and true shared memory that we can freely pass pointers to between the CPU and GPU cores? This will be my next blog post. Stay tuned!


As I’m heading home after three exciting days at the AMD’s Fusion Developer Summit 2011, I’d like to share with you my findings, thoughts and ideas I got out of this event. It had five fascinating tracks each one had around 10 sessions over the four days. The Programming Models track was the most interesting and exciting, at least to me. It is tightly coupled with the new AMD Fusion System Architecture (FSA). It brought with it a lot of new concepts. I can see also a lot of interesting challenges.

Let me take you in a series of posts sharing with you the excitement of these new innovations from AMD. I’ll start with a quick background of why the APUs are a good answer to many computation problems and then I’ll talk about its programming model.

So, the Fusion architecture is a reality now. It starts the era of heterogeneous computing for the common end-user. It combines the x86 heavy lifting cores with super-fast simpler GPU cores on the same chip. You probably came across articles or research papers advertising the significant performance improvement that GPUs offer compared to the CPUs. This is often heard as a result of poor CPU code and the inherently massive parallelism of the algorithms.

The APUs architecture offers the balance between these worlds. GPU cores are optimized for arithmetic workloads and latency hiding. However, CPU cores deal with the branchy code for which branch prediction and out-of-order execution are so valuable. They both built for different design goals in mind:

  • CPUs design is based on maximizing performance of a single thread. They allocate transistors budget (or chip area) in: branch prediction, out-of-order execution, extensive caching, and deep pipelines.
  • GPUs design aims to maximize throughput at the cost of lower performance for each thread. They use the area in having more cores of simpler designs by not implementing branch prediction, out-of-order, or large caches.

Hence, these architectures hide memory latency in different ways.

So, in the CPUs world memory stalls are of high cost and they are harder to cover. Because of the several caching hierarchies, it takes many cycles to cover a cache miss. That’s why a larger cache reduces is necessary to reduce memory stalls. Also the out-of-order execution makes the pipeline busy doing useful computations while cache misses are served for some other instructions.

GPUs, however, use different techniques to hide memory latency. They issue an instruction over multiple cycles. For example, a large vector execute on a smaller vector unit. This reduces instruction decode overhead and improves throughput. Executing many threads concurrently by interleaving their instructions fills the gaps in the instructions stream. So, they depend on the aggregated performance of all executing threads and not reducing the latency of a single thread. GPU’s cache, however, is designed to improve spatial locality of instructions execution and not focusing on temporal locality. That’s why they are very efficient in retrieving large vectors through many banks they offer for the SIMD fashioned data fetching.

So choosing either of these two worlds comes with a cost. For example, CPUs large caches to maximize number of cache hits and the support the out-of-order execution consumes a much budget of the available transistors on the chip. The GPUs however cannot handle branchy code efficiently; they are effective most on massively parallel algorithms that can be solved in vectors and many independent threads. So, each one is for a specific type of algorithms or a problem domain. For a concrete case study have a look at the table below comparing representatives of the CPU and GPU sides.

AMD Phenom II – x86 AMD Radeon HD6070
  • 6 cores 4-way SIMD (ALUs)
  • A single set of registers per core
  • Deep pipeline supporting out-of-order execution
  • 24 simple cores 16-way SIMD
  • 64-wide SIMD state (threads count per CU)
  • Multiple register sets shared
  • 8 or 16 SIMD engines per core

And this is when the Eureka! moment came to the AMD engineers & researchers to reconsider of microprocessors and design the Accelerated Processing Units (APUs). Combining both architectures on a single chip may solve many problems efficiently, specially for multimedia and gaming related. The E350 APU for example combines two “Bobcat” cores and two “Cedar”-like cores, which includes 2 and 8-wide SIMD engines on the same chip!

So let me take through an example in my next post to show you quickly the current and future models on these APUs. Also, I’ll be writing about: the run-time models, the software ecosystem of APUs, and the Roadmap of the AMD Fusion System Architecture (FSA)

The aim of technology has always been solving problems facing humanity. The aim isn’t only to
solve problems, but to solve them efficiently in the least possible time. With time problems
become more complex to solve (i.e.: better hardware needed to solve these problems quickly).
In the field of Computer Science, the hardware component responsible for how fast will a
problem be solved is the processor.
In the past hardware corporations like Intel used to introduce faster processor each year. But at
some point they figured out that they wouldn’t be able to introduce new uni-core processors
with more transistors; and since the speed of a processor is directly proportional with the
number of transistors, it wasn’t feasible to have faster uni-core processors.
Hardware corporations found that the way to introduce faster processors is to introduce multicore
processors. Introducing multi-core processors resulted in a problem that former
programming models don’t support multi-core processors programming. So, new programming
models and languages were introduced for programmers to be able to utilize the presence of
multi-cores in one processor. The article demonstrates one of these new models which is

Introduction:-The aim of technology has always been solving problems facing humanity. The aim isn’t only tosolve problems, but to solve them efficiently in the least possible time. With time problemsbecome more complex to solve (i.e.: better hardware needed to solve these problems quickly).In the field of Computer Science, the hardware component responsible for how fast will aproblem be solved is the processor.In the past hardware corporations like Intel used to introduce faster processor each year. But atsome point they figured out that they wouldn’t be able to introduce new uni-core processorswith more transistors; and since the speed of a processor is directly proportional with thenumber of transistors, it wasn’t feasible to have faster uni-core processors.Hardware corporations found that the way to introduce faster processors is to introduce multicoreprocessors. Introducing multi-core processors resulted in a problem that formerprogramming models don’t support multi-core processors programming. So, new programmingmodels and languages were introduced for programmers to be able to utilize the presence ofmulti-coresin one processor. The article demonstrates one of these new models which isCilk++.

Cilk++ VS C++:-

Cilk++ development started in 1994 in of the MIT labs. Cilk++ is based on C++. So, writing Cilk++

is exactly the same as writing C++ with the ability of programming parallel applications with the

use of some new keywords specifically introduced to Cilk++ to enable parallel programming.

The new keywords are: cilk, spawn, sync, inlet and abort. The “cilk” keyword is added to the

header of any function to notify the compiler that this function may include parallel

programming (i.e.: one of the other Cilk++ specific keywords might be included in this function).

Below is a detailed description for the rule of each of the other keywords.



The “spawn” keyword can be added to the start of any line of code to notify the processor that

it can execute this line of code on a separate core of the processor if possible. This line of code

might be a call for function or even a set of functions giving the ability to run a piece of code



The “sync” keyword is closely related to the “spawn” keyword. After one or more lines of code

has been “spawned” the “sync” keyword can be put afterwards to notify the processor that

should stop executing the code till all spawned processes finish processing. The advantage of

“sync” is to synchronize parallel processes in order to ensure secure coding (i.e.: no problem

will take place as a result of a process demanding resources that are being used by another

process running in parallel). Below is a figure showing the functionality of “sync”.


The “inlet” keyword is used in advanced programming. Whenever a parent process spawns

other processes, these child processes are supposed to return results to the parent process.

Inlet is used in order to make sure that no write conflicts will take place as a result of multiple child processes writing in a variable in the same parent process. An “inlet” is like a function that

can be written inside another function to control how return values of child processes are to be

written to the parent process and to ensure that only one child process can write to the return

value of the parent process at a time. The code below shows how “inlet” can be used in a

function used to calculate Fibonacci numbers.

cilk int fib (int n)


int x = 0;

inlet void summer (int result)


x += result;



if (n<2) return n;

else {

summer(spawn fib (n-1));

summer(spawn fib (n-2));


return (x);




The “abort” keyword is also used in advanced programming. Sometimes a process is spawned

in multiple processes, and at some point the result is reached and no further processing of

parallel processes is needed, so the abort keyword is used to stop the unneeded processing of

these other processes. For example, if a search algorithm works by means of having multiple

processes working in parallel on different parts of an array of integers or so, if one of the

processes finds the element the algorithm is searching for, it notifies the processor to stop the

execution of the rest of the processes in order not to waste time and resources.








Parallelism is an attribute whose value can be calculated to show how much beneficial was

Cilk++ for an algorithm over running the same algorithm in the ordinary sequential way. To

calculate the value of parallelism for an algorithm both work and depth has to be calculated for

this algorithm. Work is equivalent to the total number of operations done by the algorithm.

Depth is the length of the longest sequential dependence chain. Parallelism is equivalent to the

value of “work” over the value of “depth”. The figure below shows an example for parallelism.

Cilk work-stealing scheduler:-


The Cilk work-stealing scheduler is one of the most beneficial features of Cilk. The aim from it is

to ensure maximum utilization of processor’s capabilities for programs to execute in shorter

time. For example, in case of a processor with X cores and a program is running X processes on

its X cores waiting for them to finish in order to output the result; if one of these processes

finishes before the others, it starts taking tasks from the bottom of other process’s stack. The

figure below shows two processes running in parallel on two different cores, where one of

them finishes before the other then use the work-stealing feature to help the unfinished

process in its work.

Practical examples:-


Calculating Fibonacci numbers is a good example for an operation that would run in much less

time if processed in parallel on multiple cores of a processor. As shown below in the code and

figure, the recursion of the function is processed in parallel which resulted in executing the

function in much less time.

Without parallelization:


int fib (int n)


if (n <= 2) return 1;



int x, y;

x = fib (n-1);

y = fib (n-2);

return x + y;



With parallelization:

int fib (int n)


if (n <= 2) return 1;



int x, y;

x = cilk spawnfib (n-1);

y = cilk spawnfib (n-2);

cilk sync;

return x + y;



Quick Sort:

Quick sort is regarded as one of the efficient algorithms for sorting data structures of data

according to a specific criterion for sorting. Quick sort is one of those algorithms that would run

in much less time if implemented using Cilk++ to run on multi-core processors. The idea behind

quick sort is to take the 1st element of data and put it its correct order then repeat the same

thing for the items less than it and the items greater than it till all the elements are sorted.

Below is the algorithm along with a figure to demonstrate how a multi-core processor is utilized

to run it when implemented using Cilk++.


c static void QuickSort(index low, index high)


if (low < high) then


index p = Partition(low, high);

cilk spawn QuickSort (low, p -1);

cilk spawn QuickSort (p +1, high);

cilk sync;



public static index Partition(index low, index high)


Index i,



keytype pivot;

pivot = S[low];

j = low;

for (i = low + 1; i <= high; i++) do

if (S[i] < pivot) then



Exchange S[i] and S[j];


p = j;

Exchange S[low] and S[p];

return p;


Cilk++ VS OpenMP:-

OpenMP is another programming model for parallel programming. Below is a diagram showing

difference in execution time between Cilk++ and OpenMP when executing the quick sort code

for both the adaptive and parallel models.


Cilk++ is available for windows and linux operating systems and can be downloaded from the

following link:

Cilk++ for MAC was a project under development that was never completed and not under

development anymore.


[1] Prof. Richard (Rich) Vuduc, Introduction to Cilk++. Georgia Tech, College of Computing,


[2] Mingdong Feng1, Charles E. Leiserson2, Efficient Detection of Determinacy Races in Cilk

Programs. 1 National University of Singapore, 2 MIT Laboratory for Computer Science.

[3] Matteo Frigo, Multithreaded Programming in Cilk. CILK ARTS.

[4] Charles E. Leiserson, Aske Plaat, Programming Parallel Applications in Cilk. MIT Laboratory

for Computer Science.

[5] Cilk 5.4.6 Reference Manual. MIT Laboratory for Computer Science.

[6] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H.

Randall, Yuli Zhou, Cilk: An Efficient Multithreaded Runtime System. MIT Laboratory for

Computer Science.




OpenCL was initially developed by Apple, which holds trademark rights, in collaboration with technical teams at  AMD, IBM, Intel, Nvidia, Ericsson, Nokia, Texas, and Motorola. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008 the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details. On December 8,2008 OpenCL was released.


If you’ve never heard of OpenCL, you need to stop whatever you’re doing and read ahead. We all know that multi-threaded applications have not been as abundant as we had hoped. For those precious few applications that are multi-core aware, few leverage the full potential of two cores. That is why  OpenCL was developed, to standardize parallel programming and execution.

OpenCL  architecture shares a range of computational interfaces with two competitors, NVidia’s Compute Unified Device Architecture and Microsoft’s directCompute.

What is OpenCL?

OpenCL  is a framework for writing programs that execute across  heterogeneous platforms consisting of CPUs , GPUs , Cell, DSP and  other processors. It  includes a language for writing kernels (OS),  plus APIs that are used to define and then control the platforms.

The Khronos Group, hopes that OpenCL will do for multi-core what OpenGL did for graphics, and OpenAL is beginning to do for audio, and that’s exactly what OpenCl achieved. OpenCL improved speed for a wide spectrum of applications from gaming, entertainment to scientific and medical software.

The following link is a link to a  video which shows to what extent OpenCl speeds up the execution of an application.

OpenCl Demo

How does OpenCl work?

OpenCL includes a language for writing compute kernels and APIs for creating and managing these kernels. The kernels are compiled, with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system.

Platform Model

One of OpenCL’s strengths is that this model does not specify exactly what hardware constitutes a compute device. Thus, a compute device may be a GPU, or a CPU.

OpenCL sees today’s heterogeneous world through the lens of an abstract, hierarchical platform model. In this model, a host coordinates execution, transferring data to and from an array of Compute Devices. Each Compute Device is composed of an array of Compute Units, and each Compute Unit is composed of an array of Processing Elements.

Opencl Anatomy

The platform layer API gives the developer access to routines that query for the number and types of devices in the system. The developer can then select and initialize the necessary compute devices to properly run their work load. It is at this layer that compute contexts and work-queues for job submission and data transfer requests are created.

The runtime API allows the developer to queue up compute kernels for execution and is responsible for managing the compute and memory resources in the OpenCL system.

OpenCL Memory Model
OpenCL defines four  memory spaces: private, local, constant and global.

Private memory is memory that can only be used by a single compute unit. This is similar to registers in a single compute unit or a single CPU core.

Local memory is memory that can be used by the work-items in a work-group. This is similar to the local data share that is available on the current generation of AMD GPUs.

Constant memory is memory that can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on AMD GPUs.

Global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs.


The Execution Model

There are three basic components of executable code in OpenCL: Kernels, programs, applications queue kernels.

A compute kernel is the basic unit of executable code and can be thought of as similar to a C function.  Each kernel is called a work item, where each of which has a unique ID.

Execution of such kernels can proceed either in-order or out-of-order depending on the parameters passed to the system when queuing up the kernel for execution. Events are provided so that the developer can check on the status of outstanding kernel execution requests and other runtime requests.

In terms of organization, the execution domain of a kernel is defined by an N-dimensional computation domain. This lets the system know how large of a problem the user would like the kernel to be applied to.

Each element in the execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes.

Executing Kernels, Work-Groups and Work-Items

A program is a collection of kernels and other functions. So a group of kernels are called a program.

Applications queue kernels are queues of kernels which are queued in order and executed in order or out of order.

Since OpenCL is meant to target not only GPUs but also other accelerators, such as multi-core CPUs, flexibility is given in the type of compute kernel that is specified. Compute kernels can be thought of either as data-parallel, which is well-matched to the architecture of GPUs, or task-parallel, which is well-matched to the architecture of CPUs.

Data parallelism:
focuses on distributing the data across different parallel computing nodes.

To achieve data parallelism in OpenCL:

1.define N-Dimensional computation domain

  • Each independent element of execution in N-D domain is called a work-item
  • The N-D domain defines the total number of work items that execute in parallel — global work size.

2.Work-items can be grouped together — work-group

  • Work-items in group can communicate with each other
  • we Can synchronize execution among work-items in group to coordinate memory access

3.Execute multiple work-groups in parallel

example of data parallelism in OpenCL:
Data parallelism

Task parallelism:

focuses on distributing execution processes (threads) across different parallel computing nodes.

this can be achieved by synchronizing work items within a work group.

OpenCL Objects

  • Setup objects:
  1. Devices : Gpu, Cpu, Cell.
  2. Context : collection of devices.
  3. Queues : submit work to the device.
  • Memory objects:
  1. Buffers : Blocks of memory
  2. Image objects : 2D or 3D images.
  • Execution :
  1. programs.
  2. Kernels.

How to submit work to the computing devices in the system?

There are three basic steps to do this:
  1. compile the programs you wrote.
  2. set the arguments and parameters of each kernel  to the desired values and create memory objects and buffers .
  3. use command queues to en queue those kernels and send the code to execution.
After finishing the previous three steps , we must know the number and types of devices and hardware we have.
first you must query for the devices in the system using clGetDeviceIDS .
then create a context to put the devices in so that they can share data and communicate and this is achieved using clCreatContext.
the last thing you have to do is to create command queue to allow us to talk to these devices.
NB. a multi core device is considered one device.

Simple Example – Vector Addition Kernel

The following is a simple vector addition kernel written in OpenCL.You can see that the kernel specifies three memory objects, two for input, a and b, and a single output, c. These are arrays of data that reside in the global memory space. In this example, the compute unit executing this kernel gets its unique work-item ID and uses that to complete its part of the vector addition by reading the appropriate value from a and b and storing the sum into c.

Since, in this example, we will be using online compilation, the above code will be stored in a character array named program_source.

To complement the compute kernel code, the following is the code run on the host processor to:

  • Open an OpenCL context,
  • Get and select the devices to execute on,
  • Create a command queue to accept the execution and memory requests,
  • Allocate OpenCL memory objects to hold the inputs and outputs for the compute kernel,
  • Online compile and build the compute kernel code,
  • Set up the arguments and execution domain,
  • Kick off compute kernel execution, and
  • Collect the results.


It is really hard to decide if OpenCL will continue or not, but i think  that the future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion.Fusion is AMD’s forthcoming CPU + GPU product on one hybrid silicon chip.

This processor would be perfect for OpenCL, As that doesn’t care what type of processor is available; as long as it can be used.


Introduction While Moore’s Law continues to predict the doubling of transistors on an integrated circuit every 18 months, performance and power considerations have forced chip designers to embrace multi-core processors in place of higher frequency uni-core processors. As desktop and high-performance computing architectures tend towards distributed collections of multi-core nodes, a new parallel programming paradigm is required to fully exploit the complex distributed and shared-memory hierarchies of these evolutionary systems. Recently, a programming model has been developed that has the potential to exploit the best features of this distributed shared-memory architecture. Not only does this model promise improved runtime performance on distributed clusters of SMPs, its data and execution semantics support increased programmer productivity. This model is called the Partitioned Global Address Space (PGAS) model. The Partitioned Global Address Space (PGAS) paradigm provides both a data and execution model that has the potential to dramatically improve runtime performance and programmer productivity on multi-core architectures using shared memory. Memory Models There are 2 models for memory usage:

  1. Shared Memory Model.
  2. Distributed Memory Model

Shared Memory Model
The shared-memory programming model typically exploits a shared memory system, where any memory location is directly accessible by any of the computing processes (i.e. there is a single global address space). This programming model is similar in some respects to the sequential single-processor programming model with the addition of new constructs for synchronizing multiple access to shared variables and memory locations. Distributed Memory Model The distributed-memory programming model exploits a distributed-memory system where each processor maintains its own local memory and has no direct knowledge about another processor’s memory (a “share nothing” approach). For data to be shared, it must be passed from one processor to another as a message. Why PGAS? The PGAS is the best of both worlds. This parallel programming model combined the performance and data locality (partitioning) features of distributed memory with the programmability and data referencing simplicity of a shared-memory (global address space) model. The PGAS programming model aims to achieve these characteristics by providing:

  1. A local-view programming style (which differentiates between local and remote data partitions).
  2. A global address space (which is directly accessible by any process).
  3. Compiler-introduced communication to resolve remote references.
  4. One-sided communication for improved inter-process performance.
  5. Support for distributed data structures.

In this model variables and arrays can be either shared or local. Each process has private memory for local data items and shared memory for globally shared data values. While the shared-memory is partitioned among the cooperating processes (each process will contribute memory to the shared global memory), a process can directly access any data item within the global address space with a single address. Languages of PGAS Currently there are three (3) PGAS programming languages that are becoming commonplace on modern computing systems:

  1. Unified Parallel C (UPC)
  2. Co-Array Fortran (CAF)
  3. Titanium

Unified Parallel C (UPC) Its an extended parallel extension of ANSI C with a distributed shared memory parallel programming language. Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C. The UPC provides standard library functions to move data to/from shared memory which can be used to move chunks in the shared space or between shared and private spaces. UPC Execution Model A number of threads working independently in SPMD (Single Process, Multiple Data) fashion. MYTHREAD specifies thread index (0..THREADS-1) and the number of threads specified at compile time or run time. No implicit Synchronization among the threads, only when needed. There are 4 mechanisms:

  1. Barriers: for blocking and non-blocking.
  2. Locks: to protect data against multiple writers.
  3. Memory consistency control: has to do with the order of shared operations.
  4. Fence: equivalent to null strict reference to ensure that all shared references are issued.

A quick Example //vect_add.c #include <upc_relaxed.h> #define N 100*THREADS shared int v1[N], v2[N], v1plusv2[N]; void main(){ int i; for(i=0; i If (MYTHREAD==i%THREADS) v1plusv2[i]=v1[i]+v2[i]; } UPC Runtime model
The figure shows the high-level system diagram for a UPC application compiled using the Berkeley UPC compiler. The generated C code runs on top of the UPC runtime system, which provides platform independence and implements language-specific features such as shared memory allocation and shared pointer manipulation. The runtime system implements remote operations by calling the GASNet communication interface, which provides hardware-independent lightweight networking primitives. UPC Memory model
A shared pointer can reference all locations in the shared space, while a private pointer may reference only addresses in its private space or in its portion of the shared space. Static and dynamic memory allocations are supported for both shared and private memory. UPC pointers There are 4 different ways for declaring pointers in UPC, each way declare a different type of pointer

  1. Int *p1; This is a private pointer pointing locally. it could be used to access private data or local shared data.
  2. Shared int *p2; This is a private pointer pointing in to shared space. it could be used for independent access of threads to data in shared space.
  3. Int *shared p3; This is a shared pointer pointing locally, but its not recommended.
  4. Shared int *shared p4; This is a shared pointer pointing to the shared space. it could be used for common access of all threads to data in shared space.

Co-Array Fortran (CAF) The CAF is a simple extension to Fortran 90 that allows programmers to write efficient parallel applications using a Fortran-like syntax. It also assumes the SPMD programming model with replicated data objects called co-arrays. Co-array objects are visible to all processors and each processor can read and write data belonging to any other processor by setting the index of the co-dimension to the appropriate value. The CAF creates multiple images of the same program where text and data are replicated in each image. it marks some variables with co-dimensions that behave like normal dimensions and express a logical problem decomposition. It also allows one sided data exchange between co-arrays using a Fortran like syntax. On the other hand, CAF requires the underlying run-time system to map the logical problem decomposition onto specific hardware. CAF Syntax The CAF syntax is a simple parallel extension to normal Fortran syntax, where it uses normal rounded brackets () to point data in local memory, and square brackets [] to point data in remote memory. CAF Execution Model The number of images is fixed and each image has its own index, retrievable at run-time. Each image executes the same program independently of the others and works on its own local data. An image moves remote data to local data through explicit CAF syntax while an “object” has the same name in each image. The programmer inserts explicit synchronization and branching as needed. CAF Memory Model There are 4 memory models:

  1. One-to-one model.
  2. Many-to-one model.
  3. One-to-many model.
  4. Many-to-many model.

What do co-dimensions mean? real :: x(n)[p,q,*]

  • Replicate an array of length n, one on each image.
  • Build a map so each image knows how to find the array on any other image.
  • Organize images in a logical (not physical) three dimensional grid.
  • The last co-dimension acts like an assumed size array: *
  • A specific implementation could choose to represent memory hierarchy through the co-dimensions.

CAF I/O There is one file system visible to all images, where an an image can open a file alone or as a part of a team. The programmer controls access to the file using direct access I/O and CAF intrinsic functions.
Titanium The Titanium is based on java but on compile, its first compiled to C then to machine code. It has the same SPMD parallelism model as UPC and CAF but dynamic java threads are not supported. The Titanium analyzes global synchronization and optimizes pointers, communication and memory. Titanium’s global address space is based on pointers rather than shared variables.There is no distinction between a private and shared heap for storing objects. Any object maybe referenced by global or local pointers. Titanium features over java

  • Multi-dimensional arrays: iterators, sub arrays, copying.
  • Immutable “value” classes.
  • Templates.
  • Operator overloading.
  • Scalable SPMD parallelism replaces threads.
  • Global address space with local/global reference distinction.
  • Checked global synchronization.
  • Zone-based memory management (regions).
  • Libraries for collective communication, distributed arrays, bulk I/O, performance profiling.

Titanium Execution Model Titanium has the same execution model as UPC and CAF. Basic java programs maybe run as titanium programs, but all processors do all the work. Eg. Parallel hello world: class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc” + Ti.thisProc() + ” out of ” + Ti.numProcs()); } } Titanium Runtime Model The latest versions of Titanium include distributed-memory backends that communicate using GASNet, a high-performance communication interface designed especially for SPMD global address-space languages like Titanium (and UPC) that offers better portability and higher-level operations which can leverage hardware-specific features that support a global-address space model. Titanium also supports using Active Messages 2.0 as the standardized networking interface for some of the older cluster-based parallel backends. Active Messages is a low-level, high-performance communication paradigm first proposed by von Eicken et al. that basically amounts to a super-lightweight RPC mechanism, which is generally implemented as a zero-copy, fully user-level protocol that is highly tuned to the networking hardware. Titanium uses several different AM 2.0 implementations for various backends: Lanai AM AMMPI AMUDP AMLAPI Titanium memory model Globally shared address space is partitioned, where pointers are either local or global. Global pointers may point to remote locations. Conclusion

  • UPC is easy to program in for C writers, significantly than alternative paradigms at times.
  • UPC exhibits very little overhead when compared with MPI for problems that are parallel.
  • The CAF syntax gives the programmer more control and flexibility.
  • Co-dimensions in the CAF match any problem decomposition.
  • The CAF performance is better than the library based models.

The titanium has all the benefits of java plus all the features that has been added to handle parallel programming

Welcome to part 4 of my series summarizing the exascale software roadmap document. This document is a produced through a series of meetings by scientists and researchers in different areas of HPC envisioning the software stack for million cores machines. That’s a machine that is due soon in this decade with exascale computing power. In last two blog posts I summarized the Systems Software, which is concerned of operating systems, run-time systems, I/O, and systems management. This blog posting and the next one is discussing exascale project vision about the development environment, which includes interesting topics, mainly: programming models, frameworks, compilers, numerical libraries, and debugging tools. I think this section is of great importance to both computer scientists and researchers from other fields of science. It is concerned of the direct tools to build and implement needed applications or algorithms. So let’s get started.

Programming Models

Original contributors of this section are: Barbara Chapman (U. of Houston), Mitsuhisa Sato, (U. of Tsukuba, JP), Taisuke Boku (U. of Tsukuba, JP), Koh Hotta, (Fujitsu), Matthias zueller (TU Dresden, DE), Xuebin Chi (Chinese Academy of Sciences)

Authors believe that 7 technology drivers will affect the programming models significantly in this decade:

  • Increased number of nodes and explosion in the number of cores in nodes which mandates from programming models to work at different granularity levels.
  • Heterogeneity of processors, which makes a basic task of the programming models to abstract such heterogeneity.
  • Increased number of components increases the likelihood of failures to occur. Programming models should be resilient to such failures.
  • Changing nature and trends in I/O usage push programming models to consider more seriously expected I/O complexities.
  • Applications’ complexity will increase dramatically. Programming models should simplify parallel programming and help developers focus on the application or algorithm implementation rather than architectural related concerns.
  • Increased depth of software stack mandates from the programming models to detect and report failures at the proper abstraction level.


Based on these foreseen drivers, the following R&D alternative are available for the community:

  • Hybrid versus Uniform programming model. Hybrid may provide better performance but very difficult to learn and use. Uniform programming models are easier to program with; however, their abstractions may reduce performance.
  • Domain specific versus general programming models. Domain specific may provide better portability and performance compared to the general models in some application areas.
  • Widely embraced standards versus single implementation. The second option is faster to implement but the first strategy would provide more support for the applications developers.

It is very difficult to decide which of these alternatives to choose. However, it is a fact right now that most of the HPC systems will be built out of heterogeneous architectures to accelerate the compute intensive parts within the applications. This will impose the usage of hybrid programming models such as MPI and OpenMP or MPI and CUDA. According to the authors, they key for a successful programming models development is to link existing models for faster and better productivity. Such integration may give corresponding community more ideas about building a new programming model that provides unified programming interface.


Original contributors of this section are: Michael Heroux and Robert Harrison

Frameworks should provide a common collection of interfaces, tools and capabilities that are reusable across a set of related applications. It is always a challenging task for HPC systems due to their inherit complexity. I think there is some redundancy in this section. The main technology drivers I could get from this section are:

  • New applications will be implemented on top of the exascale systems. Current frameworks should be revisited to satisfy the new possible needs.
  • Scalability and extensibility are very important factors that need reconsideration due to the hybrid systems a variability of applications as well.

According to the authors, we have two options in such case:

  • No Framework. In this case a single application can be developed faster. However, a lot of redundancy will exist if ware adopting that option for all applications running on top of the exascale infrastructure.
  • Clean-Slate Framework. It takes time to develop such frameworks. However, it depends on the other components of the exascale software stack. If a revolutionary option chosen in the other components (e.g. new OS, programming model, etc.), which is less likely to occur, a new framework will be required to link all these components together.

The authors are concluding by suggesting two main critical directions for a proper framework tying all the exascale software components together:

  1. Identify and develop cross-cutting algorithm and software technologies, which is relatively easy to do, based on the experiences of the last few years on the multi- and many-core architectures.
  2. Refactoring for manycore, which is doable by understanding the common requirements of manycore programming that will be true regardless of the final choice in programming models, such as load balancing, fault tolerance, etc.


Original contributors of this section are: Barbara Chapman (U. of Houston), Mitsuhisa Sato, (U. of Tsukuba, JP), Taisuke Boku (U.of Tsukuba, JP), Koh Hotta, (Fujitsu), Matthias Mueller (TU Dresden), Xuebin Chi (Chinese Academy of Sciences)

Compilers are a critical component in implementing the foreseen programming models. The following technology trends might be the main drivers for compilers design and development for the exascale software stack:

  • Machines will have hybrid processors. Compilers are expected to generate code and collaborate with run-time libraries working on different types of processors at the same time.
  • Memory hierarchies will be highly complex; memory will be distributed across the nodes of exascale systems and will be NUMA within the individual nodes, with many levels of cache and possibly scratchpad memory. Compilers will be expected to generate code that exhibits high levels of locality in order to minimize the cost of memory accesses.

Authors of this section are using the same R&D alternatives of the programming models for the compilers. Therefore, they are proposing the following research points for compilers (I’m including important ones):

  • Techniques for the translation of new exascale programming models and languages supporting high productivity and performance, support for hybrid programming models and for programming models that span heterogeneous systems.
  • Powerful optimization frameworks; implementing parallel program analyses and new, architecture-aware optimizations, including power, will be key to the efficient translation of exascale programs.
  • Exascale compilers could benefit from recent experiences with just-in-time compilation and perform online feedback-based optimizations, try out different optimizations, generate multiple code versions or perform more aggressive speculative optimizations.
  • Implement efficient techniques for fault tolerance.
  • Compilers should interact with the development tools run-time environment for automatically instrumenting tools.
  • Compilers may be able to benefit from auto-tuning approaches, may incorporate techniques for learning from prior experiences, exploit knowledge on suitable optimization strategies that is gained from the development and execution environments, and apply novel techniques that complement traditional translation strategies.

Next Time

My next blog post will be handling important two subsections: numerical libraries and debugging tools.


This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


The first layer that should be considered is the systems software. This posting has interesting points gathered from the International Exascale Software Project (IESP) Roadmap document, specifically the systems software section.

Systems software was identified as one of the paths to the new software stack of million cores machines. Systems software consists of four main areas: (1) Operating Systems, (2) Run-Time Systems, (3) I/O Systems, (4) Systems Management and (4) External Environment.

In this posting I will be summarizing the first three areas: (1) Operating Systems, (2) Run-Time Systems, and (3) I/O Systems

Operating Systems

Original content of this section contributed by: Barney Maccabe (ORNL), Pete Beckman (ANL), Fred Johnson (DOE).

It starts by discussing the technology drivers for operating systems in exascale era:

  1. Resources that operating systems will be responsible to manage effectively will get more complex. For example the increasing number of cores and heterogeneity of these cores will make effective management of shared bus and memory critical factors of system performance.
  2. There will be an increasing emphasis on data-centric computations and that programming models will continue to emphasize the management of distributed memory resources.
  3. Multiple programming models may be used within a single program, which mandates from operating systems to provide common APIs in addition to architecture specific ones.

Given these trends, the authors are suggesting two operating systems R&D alternatives to bridge the gap between rapid changes in hardware platforms and old operating systems:

  1. Develop from scratch operating systems for many-core machines, which will require huge effort and might be impractical given efforts and industry reliance on current operating systems.
  2. Evolving existing operating systems, which are burdened with old design concepts. However, it is easier to adapt this option.

It is likely that operating systems will evolve gradually to adopt the new scope of resources management. Development efforts will start by defining a framework for HPC systems, which should take place in years 2010 and 2011. Contributors believe the following areas should be researched actively:

  • Fault tolerant/masking strategies for collective OS services
  • Strategies and mechanisms for power/energy management
  • Strategies for simulating full-scale systems


Run-Time Systems

Original contributors of this section are: Jesus Labarta (BSC, ES), Rajeev Thakur (ANL), Shinji Sumimoto (Fujitsu)

The authors believe that “The design of tomorrow’s runtime systems will be driven not only by dramatic increases in overall system hierarchy and high variability in the performance and availability of hardware components, but also by the expected diversity application characteristics, the multiplicity of different types of devices, and the large latencies caused by deep memory subsystems.” Such drivers will impose two important run-time systems design considerations: (1) power/energy constraints, and (2) application development cost. In other words, run-time systems can provide fairly accurate picture of the resources utilization, such ability makes it possible for run-time systems to get the best performance/power ratio in such massively parallel systems. Accordingly, there are two R&D alternatives for the run-time systems:

  1. Flat Model run-time Systems, which uses message passing regardless of the target thread location (e.g. within the same node or at another node)
  2. Hierarchal Model Run-Time Systems, which combines shared memory and message passing according to different run-time parameters, such as the message size, frequency of communication, etc.


Based on these alternatives and the technology drivers for the run-time systems, it is recommended to work on four priority research directions:

  • Heterogeneity. Run-time systems should abstract the heterogeneity of architecture and make applications portable to different architectures.
  • Load Balance. “Research in this direction will result in self-tuned runtimes that will counteract at fine granularity unforeseen variability in application load and availability and performance of resources, thus reducing the frequency at which more expensive application-level rebalancing approaches will have to be used.”
  • Flat Run-Times. Run-time systems should be scalable to the expected number of cores while optimizing all run-time services such as message passing, synchronization, etc.
  • Hierarchical/hybrid runtimes. How run-times can be mapped to the semantics of different architectures without losing performance and keeping a unified semantics across different platforms. This may motivate researches to experiment on different hierarchical integrations of runtimes to support models, such as MPI+other threading or task based models, threading models+accelerators, MPI+threading+accelerators, MPI+PGAS, and hierarchical task-based models with very different task granularities at each level.


I/O Systems

The original contributors of this section are: Alok Choudhary (Northwestern U.), Yutaka Ishikawa (U. of Tokyo, JP)

The authors believe that because I/O systems were designed as separate independent components from the compute infrastructure, they have already shown not to be scalable as needed. Therefore, “emerging storage devices such as solid-state disks (SSDs) or Storage Class Memories (SCM) have the potential to significantly alter the I/O architectures, systems, performance and the software system to exploit them. These emerging technologies also have significant potential to optimize power consumption. Resiliency of an application under failures in an exascale system will depend significantly on the I/O systems, its capabilities, capacity and performance because saving the state of the system in the form of checkpoints is likely to continue as one of the approaches.”

Based on these technology changes, the authors see the following possible research areas in I/O systems:

  • Delegation and Customization within I/O Middleware. Doing customization within the user space is a very good option since information about the data semantics and usage pattern can be captured effectively at this level. This should be done not for single process but across maybe all processes utilizing a single system. These middleware layers can utilize such information in intelligent and proactive caching, data reorganization, optimizations, smoothening of I/O accesses from bursty to smooth patterns.
  • Active Storage and Online Analysis. Active storage involves utilizing available compute resources to perform data analysis, organization, redistribution, etc. Online analysis can reduce storage needs through storing meta data about the stored data and possible regenerate it when acquired.
  • Purpose-Driven I/O Software Layers. I/O systems will be aware of how data will be used and accordingly data will be stored and index.
  • Software Systems for Integration of Emerging Storage Devices. Research and development of newer I/O models, and different layers of software systems including file system and middleware would be very important for the exploitation of these devices.
  • Extend Current File Systems.
  • Develop New Approach to Scalable Parallel File Systems.
  • Incorporate I/O into Programming Models and Languages. Integration would make it easier to predict the storage or reading pattern and accordingly build more efficient mechanisms, such as I/O caching, scheduling, pipelining, etc.
  • Wide-Area I/O and integration of external Storage Systems.


Next Time

In my next posting will summarize the other two areas falling under the systems software: Systems Management, and External Environments. Meanwhile, tell me what do you think about these areas as potential research directions for HPC systems working on million cores machines. Do you think that these changes will take place in coming 10 years? Does your research area fall under any of them? Would you like to add more to these directions?

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

Next Page »