As I’m heading home after three exciting days at the AMD’s Fusion Developer Summit 2011, I’d like to share with you my findings, thoughts and ideas I got out of this event. It had five fascinating tracks each one had around 10 sessions over the four days. The Programming Models track was the most interesting and exciting, at least to me. It is tightly coupled with the new AMD Fusion System Architecture (FSA). It brought with it a lot of new concepts. I can see also a lot of interesting challenges.

Let me take you in a series of posts sharing with you the excitement of these new innovations from AMD. I’ll start with a quick background of why the APUs are a good answer to many computation problems and then I’ll talk about its programming model.

So, the Fusion architecture is a reality now. It starts the era of heterogeneous computing for the common end-user. It combines the x86 heavy lifting cores with super-fast simpler GPU cores on the same chip. You probably came across articles or research papers advertising the significant performance improvement that GPUs offer compared to the CPUs. This is often heard as a result of poor CPU code and the inherently massive parallelism of the algorithms.

The APUs architecture offers the balance between these worlds. GPU cores are optimized for arithmetic workloads and latency hiding. However, CPU cores deal with the branchy code for which branch prediction and out-of-order execution are so valuable. They both built for different design goals in mind:

  • CPUs design is based on maximizing performance of a single thread. They allocate transistors budget (or chip area) in: branch prediction, out-of-order execution, extensive caching, and deep pipelines.
  • GPUs design aims to maximize throughput at the cost of lower performance for each thread. They use the area in having more cores of simpler designs by not implementing branch prediction, out-of-order, or large caches.

Hence, these architectures hide memory latency in different ways.

So, in the CPUs world memory stalls are of high cost and they are harder to cover. Because of the several caching hierarchies, it takes many cycles to cover a cache miss. That’s why a larger cache reduces is necessary to reduce memory stalls. Also the out-of-order execution makes the pipeline busy doing useful computations while cache misses are served for some other instructions.

GPUs, however, use different techniques to hide memory latency. They issue an instruction over multiple cycles. For example, a large vector execute on a smaller vector unit. This reduces instruction decode overhead and improves throughput. Executing many threads concurrently by interleaving their instructions fills the gaps in the instructions stream. So, they depend on the aggregated performance of all executing threads and not reducing the latency of a single thread. GPU’s cache, however, is designed to improve spatial locality of instructions execution and not focusing on temporal locality. That’s why they are very efficient in retrieving large vectors through many banks they offer for the SIMD fashioned data fetching.

So choosing either of these two worlds comes with a cost. For example, CPUs large caches to maximize number of cache hits and the support the out-of-order execution consumes a much budget of the available transistors on the chip. The GPUs however cannot handle branchy code efficiently; they are effective most on massively parallel algorithms that can be solved in vectors and many independent threads. So, each one is for a specific type of algorithms or a problem domain. For a concrete case study have a look at the table below comparing representatives of the CPU and GPU sides.

AMD Phenom II – x86 AMD Radeon HD6070
  • 6 cores 4-way SIMD (ALUs)
  • A single set of registers per core
  • Deep pipeline supporting out-of-order execution
  • 24 simple cores 16-way SIMD
  • 64-wide SIMD state (threads count per CU)
  • Multiple register sets shared
  • 8 or 16 SIMD engines per core

And this is when the Eureka! moment came to the AMD engineers & researchers to reconsider of microprocessors and design the Accelerated Processing Units (APUs). Combining both architectures on a single chip may solve many problems efficiently, specially for multimedia and gaming related. The E350 APU for example combines two “Bobcat” cores and two “Cedar”-like cores, which includes 2 and 8-wide SIMD engines on the same chip!

So let me take through an example in my next post to show you quickly the current and future models on these APUs. Also, I’ll be writing about: the run-time models, the software ecosystem of APUs, and the Roadmap of the AMD Fusion System Architecture (FSA)