Multi-core programming


In my earlier blog post I quickly went through the perspective of the CPUs and GPUs to scale out their performance. I also mentioned how the APU is trying to harness the goodness of both worlds. Let me quickly this time go through a simple example and show and the APUs would present an excellent platform to solve this problem.

Consider the problem of parallel summation across a very large array. How would you solve this problem on a CPU? Here is the pseudo code:

  1. Take an input array.
  2. Block it based on the number of threads (usually one per core – 4 or 8 cores).
  3. Iterate to produce a sum in each block.
  4. Reduce across threads.
  5. Vectorize your execution step through the SIMD ISA.

Have a look at the code below

  1. //Summation Across all threads
  2. float4 sum(0,0,0,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. float scalarSum = sum.x +sum.y + sum.z + sum.w;
  6. //Reduction stage to aggregate threads results
  7. float reductionValue(0);
  8. for (t <= threads_num)
  9.  reductionValue += t_sum;

Think now of an efficient implementation on the GPU:

  1. Take the input array.
  2. Block it based on the number of threads (16 per core it could be up to 64 per core).
  3. Iterate to produce a sum in each block.
  4. Reduce/Sum across threads.
  5. Vectorize through a different kernel call due to the limitations of the current execution models.
  1. //Summation Across all threads
  2. float64 sum(0,…,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. //Reduction stage to aggregate threads results
  6. float reductionValue(0);
  7. for (t <= threads_num)
  8.  reductionValue += t_sum;

They don’t look so different from each other, right? Basically you do the same steps but the main differences are the number of cores and the number of threads. On GPUs you have more way more threads to do the summation, which may complicate your model. In addition, these many threads bring with them a lot of state management overheads, context switching, and problematic stack management. On the CPU cores you may have data parallelism through the limited number of cores and threads. Narrow SIMD units simplify the problem. High clock rates and caches make serial execution efficient for each single thread. Also the simple mapping of tasks to threads allows us to create complex tasks graphs. However, this comes at the cost of many iterations for loops. So in other words, GPUs support very fine-grained data parallel execution and CPUs provide coarse-grained data parallel execution model.

APUs combine these by providing a nested data parallel code. Basically, CPUs take coarse-grained tasks and break them down to the on-chip GPUs to do faster execution of finer grained tasks. Close coupling of the CPUs and GPUs elemenates the cost of moving data between them to execute this nested data parallel model. Also, CPUs can handle conditional data parallel execution much better than GPUs; offloading computations becomes more efficient since there is virtually zero data copying for this offloading process.

Applications can now combine high and low degree of threading at almost zero cost. Also, interesting execution models are possible. You can have multiple kernels execution on the simultaneously communicate through shared buffer and relatively low synchronization overhead. So back to our example, we can now divide our array to the four CPU cores and each core then can offload the summation to the GPU threads, do the reduction at its level, and then all the CPUs can synchronize and do the reduction with very low overhead.

So, this is in terms the possibilities on the APU architecture.

The question now is: how can we easily use all these capabilities without scarifying performance? Moving from the explicit data movement between CPUs and GPUs to the shared memory spaces is tricky. CPUs use explicit vectors ISA and memory access patterns, but GPUs depend on implicit vectors through multiple threads scheduled to access adjacent memory locations simultaneously. How can these two models be targeted in an easy clear programming model with an acceptable efficiency and true shared memory that we can freely pass pointers to between the CPU and GPU cores? This will be my next blog post. Stay tuned!

As I’m heading home after three exciting days at the AMD’s Fusion Developer Summit 2011, I’d like to share with you my findings, thoughts and ideas I got out of this event. It had five fascinating tracks each one had around 10 sessions over the four days. The Programming Models track was the most interesting and exciting, at least to me. It is tightly coupled with the new AMD Fusion System Architecture (FSA). It brought with it a lot of new concepts. I can see also a lot of interesting challenges.

Let me take you in a series of posts sharing with you the excitement of these new innovations from AMD. I’ll start with a quick background of why the APUs are a good answer to many computation problems and then I’ll talk about its programming model.

So, the Fusion architecture is a reality now. It starts the era of heterogeneous computing for the common end-user. It combines the x86 heavy lifting cores with super-fast simpler GPU cores on the same chip. You probably came across articles or research papers advertising the significant performance improvement that GPUs offer compared to the CPUs. This is often heard as a result of poor CPU code and the inherently massive parallelism of the algorithms.

The APUs architecture offers the balance between these worlds. GPU cores are optimized for arithmetic workloads and latency hiding. However, CPU cores deal with the branchy code for which branch prediction and out-of-order execution are so valuable. They both built for different design goals in mind:

  • CPUs design is based on maximizing performance of a single thread. They allocate transistors budget (or chip area) in: branch prediction, out-of-order execution, extensive caching, and deep pipelines.
  • GPUs design aims to maximize throughput at the cost of lower performance for each thread. They use the area in having more cores of simpler designs by not implementing branch prediction, out-of-order, or large caches.

Hence, these architectures hide memory latency in different ways.

So, in the CPUs world memory stalls are of high cost and they are harder to cover. Because of the several caching hierarchies, it takes many cycles to cover a cache miss. That’s why a larger cache reduces is necessary to reduce memory stalls. Also the out-of-order execution makes the pipeline busy doing useful computations while cache misses are served for some other instructions.

GPUs, however, use different techniques to hide memory latency. They issue an instruction over multiple cycles. For example, a large vector execute on a smaller vector unit. This reduces instruction decode overhead and improves throughput. Executing many threads concurrently by interleaving their instructions fills the gaps in the instructions stream. So, they depend on the aggregated performance of all executing threads and not reducing the latency of a single thread. GPU’s cache, however, is designed to improve spatial locality of instructions execution and not focusing on temporal locality. That’s why they are very efficient in retrieving large vectors through many banks they offer for the SIMD fashioned data fetching.

So choosing either of these two worlds comes with a cost. For example, CPUs large caches to maximize number of cache hits and the support the out-of-order execution consumes a much budget of the available transistors on the chip. The GPUs however cannot handle branchy code efficiently; they are effective most on massively parallel algorithms that can be solved in vectors and many independent threads. So, each one is for a specific type of algorithms or a problem domain. For a concrete case study have a look at the table below comparing representatives of the CPU and GPU sides.

AMD Phenom II – x86 AMD Radeon HD6070
  • 6 cores 4-way SIMD (ALUs)
  • A single set of registers per core
  • Deep pipeline supporting out-of-order execution
  • 24 simple cores 16-way SIMD
  • 64-wide SIMD state (threads count per CU)
  • Multiple register sets shared
  • 8 or 16 SIMD engines per core

And this is when the Eureka! moment came to the AMD engineers & researchers to reconsider of microprocessors and design the Accelerated Processing Units (APUs). Combining both architectures on a single chip may solve many problems efficiently, specially for multimedia and gaming related. The E350 APU for example combines two “Bobcat” cores and two “Cedar”-like cores, which includes 2 and 8-wide SIMD engines on the same chip!

So let me take through an example in my next post to show you quickly the current and future models on these APUs. Also, I’ll be writing about: the run-time models, the software ecosystem of APUs, and the Roadmap of the AMD Fusion System Architecture (FSA)

Here you go slides I could capture in this session













Here you go day 2 keynote slides. For agility, I just posted them from my iPhone;




























More notes for an interesting session about APUs performance. You may not find this somewhere else.

PS: if you were at that session and have some extra content/material to post here, please let me know.

[Update: I have some performance figures but the image is not clear. I’ll try to decipher it when I have some rest]

So, what’s really new in AMD’s APUs?

One of the key parts of the system of the system is the data path between the GPU and memory

  • Provide low latency access for CPU cores (optimized around caches)
    • Random access, branchy, single threaded, scalar code
  • Provides high throughput access for GPU cores (optimized around latency hiding)
    • Streaming, vectorized, massively multithreaded, data-intensive code
  • LIano introduced two new buses for the GPU to access memory:
    • AMD fusion compute link (ONION):
      • This bus is sued by the GPU when it needs to snoop the CPU cache, so is coherent bus
      • This is used for cacheable system memory
        • Radeon memory bus (GARLIC)
          • This bus is directly connect to memory and can saturate memory bandwidth, so is

The GPU in the Llano system

  • On llano, the GPU core is still exposed as a separate graphics engine
    • The GPU is managed by the OS via drivers
      • Leverage existing driver stacks to support the current ecosystem
    • Memory is split into regular system memory, and carved out “local memory”
    • Allows the GPU memory controller to optimize throughput and priorities of the GFX clients
  • Existing and familiar APIs can be used to access the GPU core
    • OpenCL, OpenGL, DirectX, and multimedia ………….

GPU in the system

  • Both CPU and GPU have their own set of page tables, caches and TLB
    • The memory is generally not coherent
    • The GPU can probe the CPU cache…..
    • …. But the CPU relies on the driver for synchronization (map/unmap, lock/unlock, flush GPU caches)
  • The current programming model is direct consequence:
    • CPU access will page fault on a single access, and the OSwill page in/out on demand
    • GPU access is known upfront, and driver or OS will page in/out on scheduling. (NOT ON DEMAND)

What is Zero copy?

  • Many different meanings:
    • A kernel access system memory directly for either read or write
    • A DMA transfer access system memory directly without copying into USWC
    • The CPU directly writes into local memory without doing any DMA
  • OpenCL offers several mechanisms to effectively reduce extract copying
  • OpenGL has some driver optimization and some proprietary extensions
    • on Llano, this matters even more than on discrete because bandwidth is shared

CPU & GPU Memory Move Scenarios

CPU access to local memory

  • CPU writes into local frame buffer
    • on llano, this can peak at 8 GB/s
      • on discrete, this was limited by the PCIe bus to around 6 GB/s ( or less)
    • the data first goes through the WC buffers on the CPU, then goes to the GPU core and goes back through the unb to memory
  • CPU reads from local framebuffer
    • those are still very slow
      • accesses are uncached
      • only a single outstanding read is supported
      • create the buffer with CL_MEM_USE_PERSISTENT_MEM_AMD flag (OpenCL)

CPU access to USWC memory

  • the CPU writes go through the WC
    • this avoids polluting the CPU cache, when it is known that thre will be no cache hit for reads
    • this allows further access by the GPU for this memory without snooping the cache
  • CPU reads will first flush the WC, then will be uncached (slow)

CPU access to a cacheable memory

  • CPU access to cacheable memory
    • this is the typical case in c++ code
    • single threaded performance: 8GB/s for either read or write
    • multithreaded performance: 13 GB/s for either read ro write
  • the memory can be accessed by the GPU
    • pages need to be made resistant by the os, and locked to prevent paging
    • physical pages need to be programmed into the GPU HW virtual memory page tables

CPU access to local memory

  • GPU reads from local frambuffer
    • this is the optimal path to memory
      • tadeon memory bus (GARLIC)_ avoids any can ceyh snooping
      • memory is interleaved to increase throughput efficiency
    • kernels and shaders can saturate DRAM bandwidth (measured at 17 GB/s)
  • GPU writes to local framebuffer are similar (i.e. memcopy)
    • kernel and shaders can saturate dram bandwidth (measured at 13 GB/s)

GPU access to USWC memory

  • GPU accesses the USWC memory uses the Radeon memory bus (GARLIC)
    • memory does not have the same interleaving granularity as local memory
    • so slightly lower performance than local memory, but faster than cacheable memory
    • reads can saturate dram bandwidth (measured at 12 GB/s)_
    • writes are similarly fast but …
      • usually avoided, however, since CPU reads are really slow from uncached space

GPU access to cacheable memory

  • GPU access to cacheable memory
    • this can be used directly by a kernel or for data upload to the GPU


  • WC: write combine buffers
    • There are 4 WC buffers per core
      • Once WC buffer is automatically assigned for a write operation
      • If the writes are contiguous, then it is efficient
      • If there are many noncontiguous writes, then partial WC flushes will lower the efficiency
    • The WC buffers are automatically flushed to memory when the GPU is accessed
  • Cacheable memory
    • This is the traditional L1/L2 architecture on AMD CPUs
    • CPU accesses are fast for both read and write
    • Multithreading (from multiple cores) is often necessary to saturate full bandwidth
    • When the GPU access this type of memory, the caches are snooped to ensure coherency.
  • USWC: Uncached speculative write combined
    • CPU reads are uncached (slow), CPU writes got through the WC buffers
    • GPU access to this type of memory does not need CPU cache probing
  • Local video memory:
    • Memory managed by the graphics driver, not available to the OS for generic CPU processes
  • Memory pinning and locking
    • Operation done by the OS for access of the system pages by the GPU:
      • Make the page resident (no long in the swap file)
      • Remove this page from regular CPU paging operation
      • Program the GPU virtual memory to map the pages into a contiguous
  • TLB: Translation Lookaside buffer
    • A dedicated cache used to store the result of page transaction (both CPU and GPU)
  • UNB: unified north bridge
    • Arbitrates memory traffic from the GPU client, and CPU cores.



Quickly introducing the App Profiler and Kernel Analyzer, couldn’t catch all the details but here you go what I could write down.


AMD app profiler

What is AMD app profiler?

  • A performance analysis too that gathers data from the OpenCL run=time and AMD APUs and GPUs during the execution of an OpenCL app
  • Integration into MS Visual Studio 2008 and 2010
  • Command line utility program for windows and Linux platforms
  • Support OpenCL and direct compute
  • No code pro project modification to target application necessary.

Key features

  • API trace View: view API input arguments and output results
  • Summary pages: find API hotspots, top ten data transfer and kernel execution operations
  • API trace analyzer: identify failed APU calls, resource leaks and best practices

What can app profiler do for you?

Timeline visualization: visualize open cl execution in a timeline chart

  • View number of OpenCL contexts and command queues created and the relationships between these items
  • View host and device execution
  • Determine proper synch

Session profile view: analyze OpenCL kernel execution for AMD Radeon GPUs

  • Collect GPU performance counters
    • The number of ALU, global and local memory instructions executed
    • GPU utilization and memory access characteristics
    • Shader Compiler VLIW packing efficiency
  • Show the kernel resource usages


AMD App Kernel Analyzer

Key Features

  • Compiler, analyze and disassemble an OpenCL kernel for multiple Catalyst driver versions
  • …………


Best Performance Optimization Practices

  • Collect API trace first
  • Remove all error and warnings reported by API trace analyzer if possible
  • Use timeline visualization to determine possible synchronization issues
  • Find the top bottleneck of the application using the summary pages
  • Drill down the most expensive kernel using session profile view



I’ll attend (ISA) The AMD’s Fusion Developer Summit in Bellevue, WA from June 13th to 16th. Stay tuned for my related blog posts and my tweets during the day as well.

You can follow my Twitter Account:

Also feel free to send me your asks or questions about it.

Next Page »