In my earlier blog post I quickly went through the perspective of the CPUs and GPUs to scale out their performance. I also mentioned how the APU is trying to harness the goodness of both worlds. Let me quickly this time go through a simple example and show and the APUs would present an excellent platform to solve this problem.

Consider the problem of parallel summation across a very large array. How would you solve this problem on a CPU? Here is the pseudo code:

  1. Take an input array.
  2. Block it based on the number of threads (usually one per core – 4 or 8 cores).
  3. Iterate to produce a sum in each block.
  4. Reduce across threads.
  5. Vectorize your execution step through the SIMD ISA.

Have a look at the code below

  1. //Summation Across all threads
  2. float4 sum(0,0,0,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. float scalarSum = sum.x +sum.y + sum.z + sum.w;
  6. //Reduction stage to aggregate threads results
  7. float reductionValue(0);
  8. for (t <= threads_num)
  9.  reductionValue += t_sum;

Think now of an efficient implementation on the GPU:

  1. Take the input array.
  2. Block it based on the number of threads (16 per core it could be up to 64 per core).
  3. Iterate to produce a sum in each block.
  4. Reduce/Sum across threads.
  5. Vectorize through a different kernel call due to the limitations of the current execution models.
  1. //Summation Across all threads
  2. float64 sum(0,…,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. //Reduction stage to aggregate threads results
  6. float reductionValue(0);
  7. for (t <= threads_num)
  8.  reductionValue += t_sum;

They don’t look so different from each other, right? Basically you do the same steps but the main differences are the number of cores and the number of threads. On GPUs you have more way more threads to do the summation, which may complicate your model. In addition, these many threads bring with them a lot of state management overheads, context switching, and problematic stack management. On the CPU cores you may have data parallelism through the limited number of cores and threads. Narrow SIMD units simplify the problem. High clock rates and caches make serial execution efficient for each single thread. Also the simple mapping of tasks to threads allows us to create complex tasks graphs. However, this comes at the cost of many iterations for loops. So in other words, GPUs support very fine-grained data parallel execution and CPUs provide coarse-grained data parallel execution model.

APUs combine these by providing a nested data parallel code. Basically, CPUs take coarse-grained tasks and break them down to the on-chip GPUs to do faster execution of finer grained tasks. Close coupling of the CPUs and GPUs elemenates the cost of moving data between them to execute this nested data parallel model. Also, CPUs can handle conditional data parallel execution much better than GPUs; offloading computations becomes more efficient since there is virtually zero data copying for this offloading process.

Applications can now combine high and low degree of threading at almost zero cost. Also, interesting execution models are possible. You can have multiple kernels execution on the simultaneously communicate through shared buffer and relatively low synchronization overhead. So back to our example, we can now divide our array to the four CPU cores and each core then can offload the summation to the GPU threads, do the reduction at its level, and then all the CPUs can synchronize and do the reduction with very low overhead.

So, this is in terms the possibilities on the APU architecture.

The question now is: how can we easily use all these capabilities without scarifying performance? Moving from the explicit data movement between CPUs and GPUs to the shared memory spaces is tricky. CPUs use explicit vectors ISA and memory access patterns, but GPUs depend on implicit vectors through multiple threads scheduled to access adjacent memory locations simultaneously. How can these two models be targeted in an easy clear programming model with an acceptable efficiency and true shared memory that we can freely pass pointers to between the CPU and GPU cores? This will be my next blog post. Stay tuned!


As I’m heading home after three exciting days at the AMD’s Fusion Developer Summit 2011, I’d like to share with you my findings, thoughts and ideas I got out of this event. It had five fascinating tracks each one had around 10 sessions over the four days. The Programming Models track was the most interesting and exciting, at least to me. It is tightly coupled with the new AMD Fusion System Architecture (FSA). It brought with it a lot of new concepts. I can see also a lot of interesting challenges.

Let me take you in a series of posts sharing with you the excitement of these new innovations from AMD. I’ll start with a quick background of why the APUs are a good answer to many computation problems and then I’ll talk about its programming model.

So, the Fusion architecture is a reality now. It starts the era of heterogeneous computing for the common end-user. It combines the x86 heavy lifting cores with super-fast simpler GPU cores on the same chip. You probably came across articles or research papers advertising the significant performance improvement that GPUs offer compared to the CPUs. This is often heard as a result of poor CPU code and the inherently massive parallelism of the algorithms.

The APUs architecture offers the balance between these worlds. GPU cores are optimized for arithmetic workloads and latency hiding. However, CPU cores deal with the branchy code for which branch prediction and out-of-order execution are so valuable. They both built for different design goals in mind:

  • CPUs design is based on maximizing performance of a single thread. They allocate transistors budget (or chip area) in: branch prediction, out-of-order execution, extensive caching, and deep pipelines.
  • GPUs design aims to maximize throughput at the cost of lower performance for each thread. They use the area in having more cores of simpler designs by not implementing branch prediction, out-of-order, or large caches.

Hence, these architectures hide memory latency in different ways.

So, in the CPUs world memory stalls are of high cost and they are harder to cover. Because of the several caching hierarchies, it takes many cycles to cover a cache miss. That’s why a larger cache reduces is necessary to reduce memory stalls. Also the out-of-order execution makes the pipeline busy doing useful computations while cache misses are served for some other instructions.

GPUs, however, use different techniques to hide memory latency. They issue an instruction over multiple cycles. For example, a large vector execute on a smaller vector unit. This reduces instruction decode overhead and improves throughput. Executing many threads concurrently by interleaving their instructions fills the gaps in the instructions stream. So, they depend on the aggregated performance of all executing threads and not reducing the latency of a single thread. GPU’s cache, however, is designed to improve spatial locality of instructions execution and not focusing on temporal locality. That’s why they are very efficient in retrieving large vectors through many banks they offer for the SIMD fashioned data fetching.

So choosing either of these two worlds comes with a cost. For example, CPUs large caches to maximize number of cache hits and the support the out-of-order execution consumes a much budget of the available transistors on the chip. The GPUs however cannot handle branchy code efficiently; they are effective most on massively parallel algorithms that can be solved in vectors and many independent threads. So, each one is for a specific type of algorithms or a problem domain. For a concrete case study have a look at the table below comparing representatives of the CPU and GPU sides.

AMD Phenom II – x86 AMD Radeon HD6070
  • 6 cores 4-way SIMD (ALUs)
  • A single set of registers per core
  • Deep pipeline supporting out-of-order execution
  • 24 simple cores 16-way SIMD
  • 64-wide SIMD state (threads count per CU)
  • Multiple register sets shared
  • 8 or 16 SIMD engines per core

And this is when the Eureka! moment came to the AMD engineers & researchers to reconsider of microprocessors and design the Accelerated Processing Units (APUs). Combining both architectures on a single chip may solve many problems efficiently, specially for multimedia and gaming related. The E350 APU for example combines two “Bobcat” cores and two “Cedar”-like cores, which includes 2 and 8-wide SIMD engines on the same chip!

So let me take through an example in my next post to show you quickly the current and future models on these APUs. Also, I’ll be writing about: the run-time models, the software ecosystem of APUs, and the Roadmap of the AMD Fusion System Architecture (FSA)

Posting the slides as they come out!
Speaker: Eric Demers, AMD corporate VP and CTO, Graphics Division
































Here you go slides I could capture in this session













Here you go day 2 keynote slides. For agility, I just posted them from my iPhone;




























I’m out of energy to write and post right now. I decided to post photos of the slides of this panel discussion. Panelists were:

  • Richard Vuduc, Georgia Tech.
  • Wu-Chun Feng, Virginia Tech
  • Charles Moore, AMD

  • Summit Keynote: The Programmer’s Guide to the APU Galaxy. Keynote
  • Summit Keynote: Compute Power and Energy-Efficiency: Partnerships, Standards and the ARM GPU Perspective Keynote
  • Real-Time Processing in OpenCL on GPUs
  • Heterogeneous HPC
  • Leveraging Multicore Systems for Hadoop and HPC Workloads
  • Natural UI – The Second User Interface Revolution
  • Faster Password Recovery with Modern GPUs
  • OpenCL Implementation on Heterogeneous Computing System for Real-time Rendering and Dynamic Updating of Dense 3-d Volumetric Data
  • M-JPEG Decoding Using OpenCL on Fusion
  • Optimizing Video Editing Software with OpenCL
  • Improving Presence with Depth Sensor Technology
  • Automatic Intra-Application Load Balancing for Heterogeneous Systems
  • Cilk Plus: Multi-core Extensions for C and C++
  • Success story – OpenCL
  • Advanced Rendering Techniques Using Fusion and OpenCL
  • A Methodology for Optimizing Data Transfer in OpenCL
  • OpenCL and OpenGL/DirectX Interoperability
  • Streaming video data into 3D applications
  • Accelerated Molecular Docking with OpenCL
  • Video Transcode Acceleration using AML on Heterogeneous Compute
  • OpenCL and the 13 Dwarfs
  • Multi-Media Content Management via Advanced Recognition Algorithms Leveraging Heterogeneous Computing
  • Accelerating Real-world Applications
  • Designing Natural Interfaces: Adventures in Multi-touch, Multi-user and Gestural Experiences
  • GPU Futures on the Mobile Web
  • TotalMedia Content Management & Using OpenCL GPU Acceleration
  • Fusion Enabled Video and Imaging Pipelines
  • AMD Graphics Core Next
  • Panel: Fusion Processors and HPC
  • The Fusion APU Architecture – A Programmers Perspective // Programming Models for Heterogeneous Computing
  • AMD’s x86 Open64 Compiler // GPU JIT (AKA Shader Compiler): from IL to ISA // Migration of Legacy Applications to Heterogeneous Architectures using HMPP
  • Diderot: A Parallel Domain-Specific Language for Image Analysis and Visualization // Pixel Bender // Domain Specific Tools to Expand the Code Synthesis Design Space
  • High Quality and Efficient Post Processing on GPU Compute // Real-time H.264 Video Enhancement Using AMD APP SDK // Using Fusion System Architecture for Broadcast Video
  • High-Performance Database Query Processing on a Hybrid CPU/GPU Architecture

Next Page »