Here you go slides I could capture in this session






























Server Overload

Lets say you have a company and you have a website, that website exchanges conversations between members. At first you start by a few users talking to each other. Then those users start to tell their friends about your website. And those friends start to tell other friends about your website, and so on. Till your servers can’t keep up with the demand. In order to work around this problem you either buy or rent some servers and set them up or ask someone else to set them up for you. Of course this approach costs a lot of money and takes a lot of time.

Now there is a better option, instead of hosting your website on a your dedicated server, you can host it on the Cloud. With cloud computing you have access to computing power almost instantly when you need it. So if you suddenly need more computing power you can scale up as much as you need in the cloud on demand. Thus, the clients won’t notice any difference. In general, with cloud computing you don’t focus on hardware or software, you only focus on what you need.

What’s Cloud Computing

Cloud Computing

A cloud computing is internet-based computing where shared resources, software and information are provided to computers and other devices on demand-like a public utility. “Wikipedia”

Orcale CEO Larry Ellison said: “Cloud Computing is using a computer that’s out-there”. The cloud in cloud computing provides the means through which everything – from computing power to computing infrastructure, applications, business processes to personal collaboration – can be delivered to you as a service wherever you need.

The term “cloud” is used as a metaphor for the “Internet” which is an abstraction of the infrastructure it represents. Cloud computing typically provide some online services which can be accessed using a web service or a browser while software and data are stored on servers.

Cloud Computing Characteristics

In general, Cloud computing users or customers(us) do not own the physical infrastructure. Thus, we don’t have to worry about the machines that runs the application because they don’t belong to us, so we don’t have to devote out time and resources for maintaining an application. On the other hand, all the hardware and software are on the cloud. You just subscribe for the resources you need as a service and only pay for the resources that you use. This approach leads to that overall computer usage rises dramatically as customers don’t have to engineer for the load limits.

Cloud Computing Key Features

Cloud computing has many features, some of the features are:

– Cost

Cost is claimed to be greatly reduced as you don’t need to buy any hardware or software, you only charged on the amount of resources you use. An example of a free cloud-computing service is GooGle Docs (http://docs.google.com/). With google docs you can create documents, spreadsheets and presentations online. So you don’t need to buy Microsoft office for windows or IWork for mac. All you need is a PC that’s capable of running a browser and internet connection.


Virtualizing an application means packaging the application with everything it needs to run including database, middleware and operating system to be a self-contained image that can run anywhere on the cloud.

As we know, the cloud can offer the software as a service, but applications are so hard to deploy on new environments (to meet their requirements). So people thought of why not to separate the application from the infrastructure! So application and infrastructure become independent.Virtualization separates resources and services from the underlying physical delivery environment. Allowing servers to be easily shared by many applications.


Multitenancy is the opposite of virtualization, as in multitenancy environment, multiple customers share the same application, running on the same OS, on the same hardware. The distinction between the customers is achieved during application design, so they don’t share or see each other data. Multitenancy also improves the utilization of the systems as many customers will be using the same hardware.


Regarding hardware, its the ability to go from small to large amounts of processing power with the same architecture. Regarding software products such as databases, it refers to the consistency of performance per unit of power as hardware resources increase.

The service provider can’t anticipate how customers will use the service. One customer might use the service three times a year during peak selling seasons, whereas another might use it as a primary development platform for all of its applications. Therefore, the service needs to be available all the time and it has to be designed to scale upward for high periods of demand and downward for lighter ones. Scalability also means that an application can scale when additional users are added and when the application requirements change.


Improves through the use of multiple redundant sites and the ability of a system to provide uninterrupted service despite the failure of one or more of the system’s components, this’s called “Fault Tolerance” which is failure prevention.


Since there is a third party that owns the hardware and software and just provide them as a service, maintenance isn’t our problem,the cloud service provider is the one responsible for hardware maintenance and software upgrades/maintenance.

Cloud Computing Layers

Cloud Computing Layers

Cloud computing consists of five layers:

  1. Client.
  2. Application.
  3. Platform.
  4. Infrastructure.
  5. Servers.

Cloud infrastructure services or “Infrastructure as a Service (IaaS)” delivers computer infrastructure, typically a platform virtualization environment as a service. Rather than purchasing servers, software, data center space or network equipment, clients instead buy those resources as a fully outsourced service. The service is typically billed on a utility computing basis and amount of resources consumed will typically reflect the level of activity. It is an evolution of virtual private server offerings.

The Cloud Infrastucture consists of one or more servers. The servers layer consists of computer hardware and computer software products that are specifically designed for the delivery of cloud services, including multi-core processors, cloud-specific operating systems and combined offerings.

There are two different types of services that a cloud can offer:

  1. Saas (Software as a Service): GooGle is one of the free Cloud providers that provides software as a service (Google apps).
  2. HaaS (Hardware as a Service): Amazon’s Ec2 is one of the providers that offers different computing power models (See Amazon’s EC2 Section).

Amazon’s EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. It provides a basic measure of an EC2 compute unit: One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Amazon’s EC2 executes an application on a virtual computer that’s called an instance. You have the choice of several instance types, allowing you to select a configuration of memory, CPU, and instance storage that is optimal for your application.

Amazon EC2 instances are grouped into three families: Standard, High-Memory and High-CPU. Standard Instances have memory to CPU ratios suitable for most general purpose applications; High-Memory instances offer larger memory sizes for high throughput applications, including database and memory caching applications; and High-CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family and size.

Amazon EC2 also currently supports a variety of operating systems including: RedHat Linux, Windows Server, openSuSE Linux, Fedora, Debian, OpenSolaris, Cent OS, Gentoo Linux, and Oracle Linux.

Standard Instance

—They are well suited for most applications.
Standard Instance


—They offer large memory sizes for high throughput applications, including database and memory caching applications.

High Memory Instance


High CPU Instance

—They offer more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

High CPU Instance



Cacutus is another cloud computing service provider (SaaS). It’s a complete network graphing solution designed to handle statistics and signal processing data like network bandwidth, temperatures, CPU load etc. it stores all of the necessary information to create graphs and populate them with data in a MySQL database.



PowerPoint Presentation

Since the time developers found out, that increasing the frequency of the processor in order to increase its total performance, would not be efficient (or would not function at all), the research has been shifted to increasing the number of processors on the dye and enhancing communication between them, which started the multiprocessors “movement”.

The newer trend was to design multiprocessors composed of processors, that differ in the architecture (heterogeneous MP), as we have seen in previous presentations/blog posts with the IBM Cell Processor, rather than having a homogeneous multiprocessor.

The latest trend is to offload certain applications or instructions from the CPU to the GPU, creating the GPGPU “movement”.


GPUs have a huge advantage over the CPU which is described in the following chart:

(Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

GPUs are optimized for Floating Point Operations, as well as being optimized for parallelism. On typical Multiprocessor consists of 4 cores (as with most of the Intel i-Series multiprocessors), while a GPU is composed of tens of processors. This is because CPUs could be considered memory based processors while GPUs could be called ALU based, which allows the GPU to perform more operations in parallel resulting in the high GFLOPS (FLOPS = Floating Point Operation Per Second) compared to the CPU.

Basic structure of a typical CPU (left) and GPU (right) (Source: SuperComputing Tutorial 2007, Introduction by David Luebke)

What is CUDA?

CUDA is the missing link between the developer and the GPU. It was developed by NVIDIA and is implemented in all NVIDIA GPUs starting the G80s. Before having programming architectures dedicated to programming the GPU, a programmer had to choose either between dealing with the complex APIs of the GPUs or “tricking” it by passing a texture, that contains the data or the instructions, to the GPU and then receiving a the data in the form of a texture, which typically creates a lot of overhead.

CUDA processors are programmed in CUDA C, which is basically C/C++ with some CUDA extensions, which will be mentioned and explained later on. It is important to know that in early versions of CUDA the GPU had to be programmed in C, while the CPU could be programmed in either. This is important when writing code, since the developer must know at all times, whether the code is compiled for the CPU or the GPU. Starting from CUDA 3.0 more C++ features had been enabled for the code compiled for the GPU.

CUDA Structure and Terminology

Thread : The smallest unit executing an instruction.

Block : Contains several threads.

Warp : A group of threads physically executed in parallel (usually running the same application).

Grid : Contains several thread blocks.

Kernel : An application or program, that runs on the GPU.

Device : The GPU.

Host : The CPU.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)


Threads and Blocks need to have a unique ID in order to access them while writing code. This is important, since threads and blocks are the main components when it comes to writing efficient parallel code. Within each block a thread is uniquely accessible  through the Thread ID, which is an integer between 0 and n, where n is the total number of threads within the same block. In a more complex approach – when dealing with 2D or 3D blocks – the Thread ID is calculated as a function, rather than having a fixed integer, which represents the number of the thread inside the block. Inside a 2D Block threads are positioned at (0,0), (0,1), … (n-1, n-1), as shown in the previous figure. Since the Thread ID must be of type uint3 – which will be considered a normal integer for now –  something like (3,2) for the Thread ID is not applicable.

The function for getting the Thread ID in a 2D Block is:

x + y * Dimx,

where ‘x’ and ‘y’ are the x and y indices of the thread and Dimx the x dimension of the block (in the case of the figure Dimx would be equal to 5).

In a 3D block the same applies with the difference of dealing with one more dimension which results in the function:

x + y * Dimx + z * Dimx * Dimy,

where x, y and z are the indices (same as for the 2D block) and Dimx and Dimy are the x, y dimensions of the block.

When addressing a block within a grid the same rules of addressing a thread in a block apply. We do not deal with 3D grids in CUDA, though.

Memory Model

Each thread has a private register set – which are accessible in 0 clock cycles – and local memory, which is the fastest accessible memory from the thread. Each block has  a shared memory, which can be written to/read from by all threads in the same block (hence, the name shared memory). The shared memory is the best and fastest way of communication for the threads. It is expensive to store data in the shared memory though, due to its relatively small size, therefore only variables which are needed by all the threads should be stored in it. Each grid has a global memory, which is accessible from all the blocks inside it (and therefore also the threads inside all the blocks). On a higher level each processor has its own cache and the whole device (the GPU) has a DRAM.

Of course the higher the level of the memory the bigger its size and proportionally to that is also the cost (in time units or clock cycles) to access it (be it read or write – if allowed). This is due to the increased distance from the unit trying to access the memory to the memory itself, and the latency from the memory.


It is always the most beneficial to for the developer to run a block on the same processor. Consider an example, where we have a 5 x 5 block being executed, where 25 processors are idle (this is just an example, it is unlikely -almost impossible – to have such a big number of idle processors), which means that theoretically each thread could be run by a single processor. Since the running program is a program optimised for parallelism (or else it would not make much sense to run it on a GPU and it would make even less sense to share it among 25 threads), there is a lot of data that must be shared between the threads. And since the threads run on different processors this data cannot be put in the shared memory of the block and would have to go to a higher memory (the DRAM in the worst case), and each thread will have to access this memory to retrieve the data need for operation. And just a simple comparison: accessing the shared memory as well as the global memory costs about 4 clock cycles. Accessing the global memory consumes about 400 – 600 more clock cycles for memory latency.

Hence, a block, that is divided onto several processors in most of the cases would result in a better execution time BUT the time needed to fetch data from the memory and the related idle time would result in a much worse performance than when running the block on the same processor.

What actually happens is that the processor in charge of a certain block divides the threads into warps, which are then executed in parallel. Each warp in the block is given a share of the execution time until the whole kernel is executed.

CUDA Extensions

There are four main extensions (excluding custom libraries) done to C/C++ that create the CUDA C.

1. Function Type Qualifiers

These qualifiers are written while declaring a function to decide whether the function is executed and called from the device or the host. There are three qualifiers:

  • __device__: the function is called and executed on the device
  • __shared__: the function is called from the host and executed on the device
  • __host__: the function is called and executed on the host

An function with a CUDA qualifier would have this form:

__device__ function_name (parameters),

where of course “__device__” can be replaced by any of the function qualifiers.

The two qualifiers __host__ and __device__ could be combined to create a function, that is compiled for both the host and the device.

2. Variable Type Qualifiers

Similar to the function qualifiers, the variable qualifiers decide the lifetime of a variable and in which memory it is stored.

  • __device__: the variable has the lifetime of the application and is stored in the global memory, which makes it accessible from all blocks within the grid
  • __shared__: the variable has the lifetime of the block and is stored in the shared memory, hence accessible from all threads within the block

3. Execution Configuration

When calling a global function the dimensions of the grid and the block in which this function is to be executed must be specified. This is called the execution configuration.

<<GridDimension, BlockDimension, Ns, S>>

GridDimension: Dimension of the Grid.

BlockDimension: Dimension of the Block.

Ns (optional): how much memory to allocate for this function.

S (optional): specifies the stream associated with this function.

4. Built – in Variables

There are four variables, that have been introduced to extend the C/C++ language. All of them are mainly associated with Thread/Block addressing:

  • gridDim -> specifies the dimension of the grid. Type: dim3
  • blockDim -> specifies the dimension of the block. Type: dim3
  • BlockIdx -> the unique Block ID. Type: uint3
  • ThreadIdx -> the unique Thread ID. Type: uint3


The Compiler used to compile CUDA C code is a PathScale Open64 Compiler also known as NVCC. Open64 is an open source compiler developed under the GNU License. PathScale Open64 has been further developed by the company PathScale specifically for x86-64 and Itanium processors, which is the main reason why it is optimised for parallelism.

Two important functions of the compiler are #pragma unroll, and _use_fast_math, which both result in better performance when used in coding.

#pragma unroll x: when written before a loop the following loops are then unrolled depending on the optional number ‘x’ following the #pragma unroll. There are three cases.

  1. x = 1, the following loop is not unrolled.
  2. x = n, where 1 < n, the loop is unrolled for n loops
  3. no x, the whole loop is unrolled

-use_fast_math is useful whenever the developer cares more about the performance of the written code rather than its accuracy. -use_fast_math enhances the performances by doing faster maths calculation by decreasing the accuracy of the results. Functions in -use_fast_math mode usually start with a double underscore: __mathfunction(); .

Built-in Types

The built-in types are:

char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4,  int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4,

where the first part indicates the type as we know it (in C, C++ or java), char, int, float, … , and the second part stands for the size of this type. So, for instance considering the type float4, it consists of a structure with 4 substructures which are the actual values (the actual floats). Each of those floats are accessible through the variables x,y,z and w.

As an example, we declare a variable of type float4:

float4 make_float4 (x, y, z, w);

to access x, y, z or w we write variable_name.x, variable_name.y, variable_name.z or variable_name.w, where variable_name should be replaced by the actual name of the variable, of course.

Example Code (serial -> parallel)

The following is an example of a simple code in CUDA C. The function takes as input an integer n, a float a and two float pointers x and y, and then stores in each cell in y the value a*x + y. Shown is also how this function is called.

(Source: SuperComputing Tutorial 2009 Introduction by David Luebke)

To convert this code to parallel we need to work on two things. First, we need to divide the work on the threads, so each thread could execute one of the y[i] = a*x[i] + y[i] operations. To do so each thread needs to be uniquely addressed and given a unique instruction to execute. The second thing is the function call: we need to add the execution configuration (mentioned earlier) to the function to know how many blocks and threads to allocate for this function. The following figure shows the code in implemented in parallel.

(Source: SuperComputing Tutorial 2009, Introduction by David Luebke)

In the parallel implementation of saxpy (Scalar Alpha X Plus Y) int i serves as the ID for the threads and it also corresponds to the location in the array to be read from. So, thread i reads of the i-th position of array x and y. As for the execution configuration, the function is given nblocks for the number of blocks, where nblocks is an integer depending on n, and the number of threads within each block is fixed to 256 threads per block.


To execute a single thread, what happens is that each thread reads the instructions off the memory (global or shared, depending on the instruction) and then the actual execution happens, and then the result is written back to the memory. Which means, that there is constant reading and writing from and to the memory. This is why the throughput heavily depends on the bandwidth between the threads and the memories and also the bandwidth between the CPU and the GPU. Another important aspect, that always blocks the performance is the memory latency, and as discussed before the memory latency increased the higher the memory is (global memory has a higher latency than shared memory), this is why it is smart to avoid accessing higher level memories whenever possible. As said before, accessing the shared and the global memory takes 4 clock cycles, while 400 – 600 more clock cycles are consumed due to memory latency. Sometimes it is even more beneficial to recompute data rather than caching it due to the memory latency.

Another thing bad for the total performance of the code are “if, while, do, for and switch” statements, that is because they diverge the execution path for the warps. A diverged warps are no longer executed in parallel, they must be serialised, adding to that after serialisation the diverged warp must be synchronised, which adds more instructions to be executed. Possible solutions for these statements are unrolling (as discussed earlier) and using branch prediction instead of ‘if’ and ‘switch’ statements whenever possible.

Coalesced Memory

Another important thing to take care of while writing code is coalesced and uncoalesced memory. A coalesced memory is one, where each thread reads off the address that corresponds to it. So if the base address of a certain block is n, then any thread i inside this block must access the address at: (n + i) * type_of_read, where type_of_read must be 1, 4 or multiples of 16. Any scheme other than that results in an uncoalesced memory. The following figure shows a coalesced memory.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

Both scenarios are examples of coalesced memories, whereas the right part of the figure is an example of a coalesced memory, where some threads do not participate, which results in a (relatively) insignificant worse performance.

An example of a uncoalesced memory is shown in the next figure.

(Source: NVIDIA CUDA Compute Unified Device Architecture Programming Guide Version 1.1)

In the left example Thread 3 and Thread 4 are reading off of the wrong addresses, while in the right example all threads are shifted by 4 bytes, since the base address is 128, so Thread 0 should be reading off of address 128.

To have a concrete and strong argument why coalesced memory outperforms uncoalesced memory we take a look at the results of reading 12M of floats in three ways:

1. coalesced: 356 us

2. coalesced (some threads do not participate): 357 us -> that is why the decrease in performance was labelled as “(relatively) insignificant”

3. uncoalesced: 3494 us


The shared memory is the fastest memory in a CUDA processor following the local memories of the threads. This is a result of dividing the shared memory into banks, which allows threads to access the shared memory simultaneously. An occurring problem due to banks is bank conflicts which is the result of either two or more threads trying to access the same bank, or accessing an element, that has not equal to 32 bytes.

The first problem causing bank conflicts is obvious, a bank can only serve a single thread at a time, when two or more threads try gain access, one thread is served and the rest of the threads are serialised. As for the second problem consider the following: an array of ELEMENTS is stored in the shared memory, where the size of ELEMENTS is 8 bytes, which means, that 4 ELEMENTS are stored in one bank. Assuming thread i is accessing ELEMENTS[i], which is stored in bank number j, and thread i+1 is accessing ELEMENTS[i+1]. Typically, when dealing with a 32 byte element ELEMENTS[i+1] would be stored in bank number j+1. But since we said, that ELEMENTS is of size 8 bytes ELEMENTS[i+1] is stored in the same bank, which is bank number j just as ELEMENTS[i]. So, at the end both thread i and thread i+1 try to access the same bank, although for different elements.

CUDA Libraries

CUBLAS: CUDA accelerated Basic Linear  Algebra Subprograms

CUFFT: CUDA Fast Fourier Transform

MAGMA: Matrix Algebra on GPU Multicore  Architectures

CULA: implementation of LAPACK interface

CUDA Tools

CUDA – gdb Debugger

CUDA – Memory Checker

CUDA – Visual Profiler


More than C/C++

In order to allow a wider spectrum of developers to code in CUDA, the range of languages, that can be compiled to CUDA has been broadened. There exist converters from Fortran to CUDA (NOAA F2C-ACC), Python to CUDA (PyCUDA) and Java to CUDA (jaCUDA). Unfortunately code generated using these converters is not fully optimised, which means, that some manual optimisation is still needed to generate optimised CUDA code, when dealing with CUDA converters.


The slowing pace of commodity microprocessor performance improvements combined with ever increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache based designs. In this work, we examine the potential of using the forthcoming Cell Processors as a building block for future high end computing systems. Our work contains several novel contributions. First, we give a background about the Cell Processor & it’s implementation history. Next, we give an overview about the architecture of the Cell Processors. Additionally, we compare Cell performance to benchmarks run on modern other architectures. Then, we derive some examples apply the Cell Broadband Engine Architecture technology to their designs. Also we give a brief about Cell software development & engineering under cell architecture & how software can doubles it’s efficiency & speed if implemented to use Cell Architecture. Finally we discuss the future of the Cell Architecture as one of the most advanced architectures in the market.


Sony Computer Entertainment, Toshiba Corporation & IBM had formed an alliance in year 2000 to design and manufacture a new generation of processors. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM’s design centers. Cell combines a general purpose Power Architecture core of modest performance with streamlined co-processing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The Cell architecture includes a novel memorycoherence architecture for which IBM received many patents. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. In 2005, Sony Computer Entertainment had confirmed some specifications of the Cell processor that is being shipped in it’s famous gaming console Play Station 3 console. This Cell configuration have one Power processing element (PPE) on the core, with eight physical SPE3 in silicon. This PS3’s Cell is the first Cell Architecture to be in the market. Although the Cell processor of the PS3 is not that advanced compared to current cell architectures being developed in IBM plants, it competed the most advanced processors in the market proving the architecture’s efficiency.


Cell takes a radical departure from conventional multiprocessor or multicore

architectures. In stead of using identical cooperating commodity processors, it uses a

conventional high performance PowerPC core that controls eight simple SIMD cores, called

synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller. Access to external memory is handled via a 25.6GB/s XDR memory controller. The cache coherent PowerPC core, the eight SPEs, the DRAM controller, and I/O controllers are all connected via 4 data rings, collectively known as the EIB. The ring interface within each unit allows 8 bytes/cycle to be read or written. Simultaneous transfers on the same ring are possible. All transfers are orchestrated by the PowerPC core. Each SPE includes four single precision (SP) 6 cycle pipelined FMA datapaths and one double precision (DP) halfpumped (the double precision operations within a SIMD operation must be serialized) 9 cycle pipelined FMA datapath with

4 cycles of overhead for data movement. Cell has a 7 cycle in order execution pipeline and forwarding network. IBM appears to have solved the problem of inserting a 13 (9+4) cycle DP pipeline into a 7 stage in order machine by choosing the minimum effort/performance/power solution of simply stalling for 6 cycles after issuing a DP instruciton. And now we have to take each element individually to define it and give a brief about it.

2.1 Power Processor Element

The PPE is the Power Architecture based, two way multi threaded core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 KiB instruction and a 32 KiB data Level 1 cache and a 512 KiB Level 2 cache. Additionally, IBM has included an AltiVec unit which is fully pipelined for single precision floating point. (Altivec does not support double precision floating point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar fused multiply add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fusedmultiplyadd instruction, which translates to 25.6 GFLOPS at 3.2 GHz.

2.2 Synergistic Processing Elements (SPE)

Each SPE is composed of a “Synergistic Processing Unit”, SPU, and a “Memory Flow Controller”, MFC (DMA, MMU, and bus interface). An SPE is a RISC processor with 128bit SIMD organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB

embedded SRAM for instruction and data, called “Local Storage” (not to be mistaken for “Local Memory” in Sony’s documents that refer to the VRAM) which is visible to the PPE and can be

addressed directly by software. Each SPE can support up to 4 GiB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128bit, 128 entry register file and measures 14.5 mm² on a 90 nm process. An SPE can operate on 16 8bit integers, 8 16bit integers, 4 32bit integers, or 4 single precision floatingpoint numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64bit virtual

memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a settop box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. Compared to a modern personal computer,

the relatively high overall floating point performance of a Cell processor seemingly dwarfs

the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. However, comparing only floating point abilities of a system is a one dimensional and application specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors. The Cell is designed to compensate for this with compiler assistance,

in which prepare to branch instructions are created. For double precision, as often used in personal computers, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS. Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication. Toshiba has developed a powered by four SPEs, but no PPE, called the Spurs Engine designed to accelerate 3D and movie effects in consumer electronics.

2.3 Element Interconnect Bus (EIB)

The EIB is a communication bus internal to the Cell processor which connects the various on chip

system elements: the PPE processor, the memory controller (MIC), the eight SPE co-processors, and two off chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as ‘units’. The EIB is presently implemented as a circular ring comprised of four 16Bwide unidirectional channels which counter rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU’s ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model. Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the

overall performance of the EIB as they reduce available concurrency. Despite IBM’s original desire to implement the EIB as a more powerful crossbar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the

worst case, the programmer must take extra care to schedule communication patterns where the EIB is

able to function at high concurrency levels.

2.4 Memory controller and I/O

Cell contains a dual channel next generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIOXDR link runs at 3.2 Gbit/s per pin. Two 32bit channels can provide a theoretical maximum of 25.6 GB/s. The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8bit wide point to point path. Five 8bit wide point to point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound 4

outbound lanes are supporting memory coherency.


High performance computing aims at maximizing the performance of grand challenge problems such as protein folding and accurate real time weather prediction. Where in the past, performance improvements were obtained by aggressive frequency scaling using micro architecture and manufacturing techniques, technology limits require future performance improvements be obtained from exploiting parallelism with a multi core design approach. The Cell Broadband Engine is an

exciting new execution platform answering this design challenge for compute intensive applications that reflects both the requirements of future computational workloads and manufacturing constraints.

The Cell B.E. is a heterogeneous chip multiprocessor architecture with compute accelerators achieving in excess of 200 Gflops per chip. The simplicity of the SPEs and the deterministic behavior of the explicitly controlled memory hierarchy make Cell amenable to performance prediction using a

simple analytic model. Using this approach, one can easily explore multiple variations of

an algorithm without the effort of programming each variation and running on either a fully cycle accurate simulator or hardware. With the newly released cycle accurate simulator (Mambo), we have successfully validated our performance model for SGEMM, SpMV, and Stencil Computations, as will be shown in the subsequent sections. Our modeling approach is broken into two steps commensurate with the two phase double buffered computational model. The kernels were first segmented into codes that operate only on data present in the local store of the SPE. We sketched the code snippets in SPE assembly and performed static timing analysis. The latency of each operation, issue width limitations, and the operand alignment requirements of the SIMD/quadword SPE execution pipeline determined the number of cycles required. The inorder nature and fixed local store memory latency of the SPEs makes the analysis deterministic and thus more tractable than on cache based, out of order microprocessors. In the second step, we construct a model that tabulates the time required for DMA loads and stores of the

operands required by the code snippets. The model accurately reflects the constraints imposed by

resource conflicts in the memory subsystem. For instance, concurrent DMAs issued by multiple SPEs must be serialized, as there is only a single DRAM controller. The model also presumes a conservative fixed DMA initiation latency of 1000 cycles. The model computes the total time by adding all the (outer loop) times, which are themselves computed by taking the maximum of the snippet and DMA transfer times. In some cases, the periteration times are constant across iterations, but in others it varies between iterations and is inputdependent. For example, in a sparse matrix, the memory access pattern depends on the nonzero structure of the matrix, which varies across iterations. Some algorithms may also require separate stages which have different execution times; e.g., the FFT has stages for loading data, loading constants, local computation, transpose, local computation, bit reversal, and storing the

results. For simplicity we chose to model a 3.2GHz, 8 SPE version of Cell with 25.6GB/s of memory bandwidth. This version of Cell is likely to be used in the first release of the Sony PlayStation3. The lower frequency had the simplifying benefit that both the EIB and DRAM controller could deliver two SP words per cycle. The maximum flop rate of such a machine would be 204.8 Gflop/s, with a computational intensity of 32 FLOPs/ word.


Many products are being implemented right now using Cell Processors those new hardware applications will change the aspect of performance in the world. Depending on the Cell Processors as

the brain power of those applications double the performance giving new experience to users. Now

we will derive some of those applications being implemented in many advanced technological

institutes in the world.

    1. Blade Server

IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.

4.2 Console Video Games

Sony’s Play Station 3 game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the

yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS. Although PS3’s games graphics are so advanced and heavy it runs so smoothly thanks to the Cell processor cores.

4.3 Home Cinema

Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG2 streams simultaneously on a 1920×1080 screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.

4.4 Super Computing

IBM’s new planned supercomputer, IBM Roadrunner, will be a hybrid of General Purpose CISC as well as Cell processors. It is reported that this combination will produce the first computer to run at petaflop speeds. It will use an updated version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128bit registers, reaching double precision 100 GFLOPs.

4.5 Cluster Computing

Clusters of PlayStation 3 consoles are an attractive alternative to highend systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8 node and 32 node PS3 clusters with Yellow Dog Linux preinstalled, an implementation of Dongarra’s research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using

PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. The computational

Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.

4.6 Distributed Computing

With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project Folding@Home has been recognized by Guinness World Records as the most powerful distributed network in the world. The first record was achieved on September 16,

2007, as the project surpassed one petaFLOPS, which had never been reached before by a

distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the

petaFLOPS mark on September 23, 2007. In comparison, the world’s most powerful supercomputer, IBM’s BlueGene/L, performs around 280.6 teraFLOPS. This means Folding@Home’s computing power is approximately four times BlueGene/L’s (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)


Software development for the cell microprocessor involve a mixture of conventional development practices for the POWER architecturecompatible PPU core, and novel software development challenges with regards to the functionally reduced SPU co processors. As we knew from previous sections that Cell processors are multicored with very high efficient parallelism, Software applications can double their performance if they made use of this architecture. For example IBM implemented a Linux base running under Cell Processor in order to fasten the software developing under the cell

architecture. Some Linux distributions made use of this base and developed a fully functional operating system running under cell architecture like Ubuntu yellow dog. However we have no reliable multiuse

OS so far using this architecture, most viewers believe that we are going to have reliable ones soon.

6. CELL FUTURE (Cell inside)

It’s well believed that Cell Processors will replace current processors in the next decade to replace current architectures in personal computers thanks to it’s performance and efficiency in addition to it’s low production cost. As with IBM already claiming the Cell processor can run current PowerPC software, it’s not hard to imagine Apple adopting it for future CPUs. A single 4.0 GHz Cell processor in an iBook or Mac mini would undoubtedly run circles around today’s 1.251.33 GHz entry level

Macs, and a quad processors Power Mac at 4.0 GHz should handily outperform today’s 2.5 GHz Power Mac G5. Then having most of Software and Hardware producers producing compatible Cell architecture products.


This paper is dedicated for Academic purposes submitted as a research report, German University in Cairo.


[1] Wikipedia.

[2] Cactus Home Page.

[3] Cell Broadband Engine Architecture and its first implementation, IBM.

[4] A streaming processing unit for a cell processor.

ATI Radeon

Nowadays, there are a lot of graphics cards in the market. In the past it did not matter which card or GPU to get because the applications were not demanding. On the other hand, currently the cards are more advanced and are capable of providing a better graphical experience. The cards evolved from just being able to provide the poor colors 2D graphics of the past to the high detail 3D graphics of the present.

ATI is producing a series of cards called Radeon. This modern series followed a series called Rage. The Canadian company was able to establish itself in the market by either providing some of the best known graphics cards. The ATI GPUs can be found in gaming consoles such as Wii and Xbox. They also can be found in laptops.

Therefore, it became to be necessary to know how to choose a graphics card. The factors that a graphics card should be judged or selected are:

  • The use: simple like web surfing and document writing or more advanced like 3D animation and gaming.
  • The price: Does the price of the card suitable for the user budget? The price being high does not mean the card is good.
  • Core Clock: the core clock is the speed at which the graphics processor on the card operates.
  • Stream Processors: the stream processors are responsible for rendering. A big number of stream processors must exist on the  graphics card. The stream processors can be called other names as shader cores or thread processors.
  • FLOPS: The number of floating point operation per second. There are a single (32-bit) precision operations and double precision (64-bit) operations.
  • Memory :
  1. Memory Latency: the delay until the processor can access the memory. The problem that in many cases the processor is  faster than the memory.
  2. Bus width: The number of bits required to access the memory.
  3. Memory Clock: The speed at which the card can access memory.
  4. Type: DDR, DDR2, GDDR3, GDDR4, and GDDR5.
  5. Memory bandwidth: The speed at which the card can access memory. The size of the memory bus multiplied by the speed memory  core clock.
  • Power Consumption.

ATI Radeon’s Evergreen series:

The Canadian company created these series in 2009.

Products Code name Examples
HD 5400

Cedar Radeon HD 5450
HD 5500, HD 5600

Redwood Radeon HD 5670
HD 5700

Juniper Radeon HD 5770
HD 5800

Cypress Radeon HD 5870
HD 5900

Hemlock Radeon HD 5970

The architectures of the cards are related to each other and this can be viewed in the next figure.

Evergreen series

The Evergreen series architectures

Therefore speaking about any of the components in any of the architecture won’t differ. The Hemlock architecture is highly relevant to the Cypress. It can be said that the Hemlock contain two cypress.

The architecture that will be addressed in the rest of the document is the Cypress architecture.

Cypress architecture

The architecture of the Cypress

The Cypress consists of:

  • Command processor: issues commands and give it to graphics engine to translate into simpler forms.
  • Graphics engine: this engine is responsible for converting the polygons and meshes to a simpler form of data which is pixels.
  • Ultra threaded dispatch processor:Maximize utilization and efficiency by dividing the workload on the processing engines. For example if the engine is only capable of dealing with 4×4 pixel blocks and the frame size is about 16×16.

Then the frame will be divided on 16*16/4*4=16 engine.

SIMD engines

SIMD engines


The SIMD stand for Single instruction, multiple data”, it is applied here because many components can process multiple data using the same operation.

Each SIMD engine contains 16 stream units and 4 texturing units. Each stream unit consists of five 32-bit stream processors. There exist 20 SIMD engines in the Cypress.

So To get the total number of stream processors, multiply 16*5*20=1600 stream processors. And the number of texturing units =4*20=80.

  • Caches:

For each SIMD engine there exist a 8 KB L1 cache. The total size of L1 cache is 8*20=160 Kb. Each of the 8 KB caches stores unique data for each SIMD engine.

L1 cache bandwidth is 1 TB/sec while bandwidth between L1 and L2 is 435 GB/sec.

  • Stream Core:

Each Stream core has five processors in total; four of them capable of providing single precision computing while only one is used for special functions such sin ,cos , tan , and Exponential.

The stream processors are 32-bit , they can perform up 2.7 teraflops but for 64-bit operations, but the number of teraflops drops to 544 gigaflops.

Stream core

The stream core

  • Memory:
  1. 256-bit bus width.
  2. 1 GB GDDR5
  3. Memory clock speed: 2400 MHz
  4. 153.6GB/sec Bandwidth
  • Price: The price of HD 5870 is 410 $.

So the cost of one Teraflop for one precision operations=410/2.7=151.8$ and for double precision operations is 410/0.544=753.6$.As expected the double precision operation is expensive.

  • Power Consumption:

ATI Radeon aimed to reduce power consumption. The Cypress consumes idle power of 27 Watts. While at the worst case a maximum power of 188 Watts. The maximum power occurs when the user pushes the card to its highest performance by using over clocking. This doesn’t occur for users who just use the card for simple purposes.

The power consumed to provide one teraflops. The worst case is at maximum power.

Single precision Double precision
Maximum power 188/2.7=69.6 Watts 188/0.544=345.5 Watts

CrossFireX technology:

The CrossFireX enables the user to put from one to Four ATI Radeon cards on the same motherboard. The Evergreen series cards support the CrossFireX.



The technology must be also supported in the motherboard. The next figure illustrates the combination of cards and which motherboard can provide this technology.

Compatibility chart

Compatibility chart for the combination of the cards

The technology operates using one of the following modes:

  • Scissors: If there is one frame to be processed by two cards, if the cards are the same, there will be no problem. Each card will process a half of the screen. But what happens if the two cards are different?

The portions of the frame will be divided according to the capabilities of the card. The faster card will render a larger portion than the slower card. This will make the cards finish at the same time.

  • SuperTiling: If the frame is for example 4×4 pixels. The frame then is divided into tiles like the chessboard. One card will render frames 1,3,5 until 15 and the other will render 2,4,6 until 16.
  • Alternate Frame Rendering: When a card is processing the present frame, another card is processing the next (future) frame.
  • Super AA: AA stand for anti aliasing. It provides ant aliasing to increase image quality.

The CrossFireX technology is highly similar to nVidia SLI. In the cards produced by both companies, there is a high similarity in components and architecture which makes it easy to compare and decide which card to purchase.

The next series after the Evergreen is called northern islands and will be available in the market in the end 2010 or 2011. The major difference that the northern island has 32 nm fabrication processes while the Evergreen has 40 nm.

The northern island will exceed Evergreen in performance but with the CrossFireX introduced and using more than one Evergreen card together. This will help in keeping up the performance and not upgrading to a new card but not for long.








You can now follow me on Twitter at this address:


Hopefully I’ll be able to do it often enough !