I’m out of energy to write and post right now. I decided to post photos of the slides of this panel discussion. Panelists were:

  • Richard Vuduc, Georgia Tech.
  • Wu-Chun Feng, Virginia Tech
  • Charles Moore, AMD


  • Summit Keynote: The Programmer’s Guide to the APU Galaxy. Keynote
  • Summit Keynote: Compute Power and Energy-Efficiency: Partnerships, Standards and the ARM GPU Perspective Keynote
  • Real-Time Processing in OpenCL on GPUs
  • Heterogeneous HPC
  • Leveraging Multicore Systems for Hadoop and HPC Workloads
  • Natural UI – The Second User Interface Revolution
  • Faster Password Recovery with Modern GPUs
  • OpenCL Implementation on Heterogeneous Computing System for Real-time Rendering and Dynamic Updating of Dense 3-d Volumetric Data
  • M-JPEG Decoding Using OpenCL on Fusion
  • Optimizing Video Editing Software with OpenCL
  • Improving Presence with Depth Sensor Technology
  • Automatic Intra-Application Load Balancing for Heterogeneous Systems
  • Cilk Plus: Multi-core Extensions for C and C++
  • Success story – OpenCL
  • Advanced Rendering Techniques Using Fusion and OpenCL
  • A Methodology for Optimizing Data Transfer in OpenCL
  • OpenCL and OpenGL/DirectX Interoperability
  • Streaming video data into 3D applications
  • Accelerated Molecular Docking with OpenCL
  • Video Transcode Acceleration using AML on Heterogeneous Compute
  • OpenCL and the 13 Dwarfs
  • Multi-Media Content Management via Advanced Recognition Algorithms Leveraging Heterogeneous Computing
  • Accelerating Real-world Applications
  • Designing Natural Interfaces: Adventures in Multi-touch, Multi-user and Gestural Experiences
  • GPU Futures on the Mobile Web
  • TotalMedia Content Management & Using OpenCL GPU Acceleration
  • Fusion Enabled Video and Imaging Pipelines
  • AMD Graphics Core Next
  • Panel: Fusion Processors and HPC
  • The Fusion APU Architecture – A Programmers Perspective // Programming Models for Heterogeneous Computing
  • AMD’s x86 Open64 Compiler // GPU JIT (AKA Shader Compiler): from IL to ISA // Migration of Legacy Applications to Heterogeneous Architectures using HMPP
  • Diderot: A Parallel Domain-Specific Language for Image Analysis and Visualization // Pixel Bender // Domain Specific Tools to Expand the Code Synthesis Design Space
  • High Quality and Efficient Post Processing on GPU Compute // Real-time H.264 Video Enhancement Using AMD APP SDK // Using Fusion System Architecture for Broadcast Video
  • High-Performance Database Query Processing on a Hybrid CPU/GPU Architecture


More notes for an interesting session about APUs performance. You may not find this somewhere else.

PS: if you were at that session and have some extra content/material to post here, please let me know.

[Update: I have some performance figures but the image is not clear. I’ll try to decipher it when I have some rest]

So, what’s really new in AMD’s APUs?

One of the key parts of the system of the system is the data path between the GPU and memory

  • Provide low latency access for CPU cores (optimized around caches)
    • Random access, branchy, single threaded, scalar code
  • Provides high throughput access for GPU cores (optimized around latency hiding)
    • Streaming, vectorized, massively multithreaded, data-intensive code
  • LIano introduced two new buses for the GPU to access memory:
    • AMD fusion compute link (ONION):
      • This bus is sued by the GPU when it needs to snoop the CPU cache, so is coherent bus
      • This is used for cacheable system memory
        • Radeon memory bus (GARLIC)
          • This bus is directly connect to memory and can saturate memory bandwidth, so is

The GPU in the Llano system

  • On llano, the GPU core is still exposed as a separate graphics engine
    • The GPU is managed by the OS via drivers
      • Leverage existing driver stacks to support the current ecosystem
    • Memory is split into regular system memory, and carved out “local memory”
    • Allows the GPU memory controller to optimize throughput and priorities of the GFX clients
  • Existing and familiar APIs can be used to access the GPU core
    • OpenCL, OpenGL, DirectX, and multimedia ………….

GPU in the system

  • Both CPU and GPU have their own set of page tables, caches and TLB
    • The memory is generally not coherent
    • The GPU can probe the CPU cache…..
    • …. But the CPU relies on the driver for synchronization (map/unmap, lock/unlock, flush GPU caches)
  • The current programming model is direct consequence:
    • CPU access will page fault on a single access, and the OSwill page in/out on demand
    • GPU access is known upfront, and driver or OS will page in/out on scheduling. (NOT ON DEMAND)

What is Zero copy?

  • Many different meanings:
    • A kernel access system memory directly for either read or write
    • A DMA transfer access system memory directly without copying into USWC
    • The CPU directly writes into local memory without doing any DMA
  • OpenCL offers several mechanisms to effectively reduce extract copying
  • OpenGL has some driver optimization and some proprietary extensions
    • on Llano, this matters even more than on discrete because bandwidth is shared

CPU & GPU Memory Move Scenarios

CPU access to local memory

  • CPU writes into local frame buffer
    • on llano, this can peak at 8 GB/s
      • on discrete, this was limited by the PCIe bus to around 6 GB/s ( or less)
    • the data first goes through the WC buffers on the CPU, then goes to the GPU core and goes back through the unb to memory
  • CPU reads from local framebuffer
    • those are still very slow
      • accesses are uncached
      • only a single outstanding read is supported
      • create the buffer with CL_MEM_USE_PERSISTENT_MEM_AMD flag (OpenCL)

CPU access to USWC memory

  • the CPU writes go through the WC
    • this avoids polluting the CPU cache, when it is known that thre will be no cache hit for reads
    • this allows further access by the GPU for this memory without snooping the cache
  • CPU reads will first flush the WC, then will be uncached (slow)

CPU access to a cacheable memory

  • CPU access to cacheable memory
    • this is the typical case in c++ code
    • single threaded performance: 8GB/s for either read or write
    • multithreaded performance: 13 GB/s for either read ro write
  • the memory can be accessed by the GPU
    • pages need to be made resistant by the os, and locked to prevent paging
    • physical pages need to be programmed into the GPU HW virtual memory page tables

CPU access to local memory

  • GPU reads from local frambuffer
    • this is the optimal path to memory
      • tadeon memory bus (GARLIC)_ avoids any can ceyh snooping
      • memory is interleaved to increase throughput efficiency
    • kernels and shaders can saturate DRAM bandwidth (measured at 17 GB/s)
  • GPU writes to local framebuffer are similar (i.e. memcopy)
    • kernel and shaders can saturate dram bandwidth (measured at 13 GB/s)

GPU access to USWC memory

  • GPU accesses the USWC memory uses the Radeon memory bus (GARLIC)
    • memory does not have the same interleaving granularity as local memory
    • so slightly lower performance than local memory, but faster than cacheable memory
    • reads can saturate dram bandwidth (measured at 12 GB/s)_
    • writes are similarly fast but …
      • usually avoided, however, since CPU reads are really slow from uncached space

GPU access to cacheable memory

  • GPU access to cacheable memory
    • this can be used directly by a kernel or for data upload to the GPU


  • WC: write combine buffers
    • There are 4 WC buffers per core
      • Once WC buffer is automatically assigned for a write operation
      • If the writes are contiguous, then it is efficient
      • If there are many noncontiguous writes, then partial WC flushes will lower the efficiency
    • The WC buffers are automatically flushed to memory when the GPU is accessed
  • Cacheable memory
    • This is the traditional L1/L2 architecture on AMD CPUs
    • CPU accesses are fast for both read and write
    • Multithreading (from multiple cores) is often necessary to saturate full bandwidth
    • When the GPU access this type of memory, the caches are snooped to ensure coherency.
  • USWC: Uncached speculative write combined
    • CPU reads are uncached (slow), CPU writes got through the WC buffers
    • GPU access to this type of memory does not need CPU cache probing
  • Local video memory:
    • Memory managed by the graphics driver, not available to the OS for generic CPU processes
  • Memory pinning and locking
    • Operation done by the OS for access of the system pages by the GPU:
      • Make the page resident (no long in the swap file)
      • Remove this page from regular CPU paging operation
      • Program the GPU virtual memory to map the pages into a contiguous
  • TLB: Translation Lookaside buffer
    • A dedicated cache used to store the result of page transaction (both CPU and GPU)
  • UNB: unified north bridge
    • Arbitrates memory traffic from the GPU client, and CPU cores.



Quickly introducing the App Profiler and Kernel Analyzer, couldn’t catch all the details but here you go what I could write down.


AMD app profiler

What is AMD app profiler?

  • A performance analysis too that gathers data from the OpenCL run=time and AMD APUs and GPUs during the execution of an OpenCL app
  • Integration into MS Visual Studio 2008 and 2010
  • Command line utility program for windows and Linux platforms
  • Support OpenCL and direct compute
  • No code pro project modification to target application necessary.

Key features

  • API trace View: view API input arguments and output results
  • Summary pages: find API hotspots, top ten data transfer and kernel execution operations
  • API trace analyzer: identify failed APU calls, resource leaks and best practices

What can app profiler do for you?

Timeline visualization: visualize open cl execution in a timeline chart

  • View number of OpenCL contexts and command queues created and the relationships between these items
  • View host and device execution
  • Determine proper synch

Session profile view: analyze OpenCL kernel execution for AMD Radeon GPUs

  • Collect GPU performance counters
    • The number of ALU, global and local memory instructions executed
    • GPU utilization and memory access characteristics
    • Shader Compiler VLIW packing efficiency
  • Show the kernel resource usages


AMD App Kernel Analyzer

Key Features

  • Compiler, analyze and disassemble an OpenCL kernel for multiple Catalyst driver versions
  • …………


Best Performance Optimization Practices

  • Collect API trace first
  • Remove all error and warnings reported by API trace analyzer if possible
  • Use timeline visualization to determine possible synchronization issues
  • Find the top bottleneck of the application using the summary pages
  • Drill down the most expensive kernel using session profile view






Here you go quick notes from the OpenCL introductory session at the AFDS 2011. Will refine them add details later today.


Introduction to OpenCL Programming


  • Targeting beginners of OpenCL programming
  • It is a heterogeneous world: many CPUs and many GPUs
  • The multi-million dollar questions
    • How do avoid developing and maintaining different source code versions?
  • CPUs
    • Lower throughput, lower latency
  • GPU
    • High ALU, high memory bandwidth, higher latency
    • Bandwidth in the order of hundreds of GB/s
    • Transfer over the PCIe
  • Fusion GPU
    • DX11 class, shares system memory with CPU
    • Bandwidth in the order of tens of GS/s
    • Zero copy
  • What’s open cl
    • Open specs for programming on heterogeneous systems
      • Multi-core CPUs
      • Massively parallel GPUs
      • Cell, FPGAs, etc.
    • Industry standard
    • Open specification
    • Cross-platform
      • Windows, Linux, mac OS
    • Multi-vendor
      • AMD, apple, Creative, IBM, ……
  • Overview
    • How to execute a program on the device (GPU)?
      • Kernel
        • Performs GPU calculations
        • Reads from, and writes to memory
      • Based on C
        • Restrictions
          • No recursion, etc.
        • Additions
          • Vector data types (int 4)
          • Synchronization
          • Built in functions (sin, cos)
    • How to control the device (GPU)
      • Host program
        • C API
      • Steps
  1. Initialize the GPU
  2. Allocate memory buffers on GPU
  3. Send data to GPU
  4. Rank kernel to GPU
  5. Read data from GPU
  • Commands are queued.
  • Kernel Anatomy
  • Work items vs work groups
  • Memory spaces
    • Memory is consistent only after the work barriers
    • Barriers are only within the workgroups. Cannot do it globally.
  • Mapping work-groups on GPUs
    • Work-groups are distributed across SIMDs
    • At minimum, work-group size should equal Wafefront size
    • Advanced tip: GPUs typically need multiple Wavefronts per SIMD for latency hiding
  • Execution on GPUs
    • Work groups are scheduled to SIMDs engines
    • We can synchronize within work group
    • Cannot synch across work groups
  • Execution on CPU
    • Each core gets a workgroup
  • Sample program
    • Reduction kernel to get the sum of large array of elements?
  • Why do we need a barrier ?
    • To guarantee memory consistency across Wavefronts



As I’m heading to the AMD Fusion Developer Summit 2011, I thought it might be a good idea to share for now sessions of the pre-summit tutorials.

  • Introduction to OpenCL
  • Advanced OpenCL and OpenGL Debugging and Profiling
  • OpenCL Application Analysis and Optimization Made Easy With AMD APP Profiler
  • Memory Model on Fusion APUs and the Benefit of Zero-Copy Approaches
  • DirectCompute Hands-On Tutorial
  • Hands on with System Dependency Analyzer (SDA) for Heterogeneous Computing
  • Performance Optimization Examples on Fusion Platforms


I’ll post later today my notes from sessions I able to attend.

I’ll attend (ISA) The AMD’s Fusion Developer Summit in Bellevue, WA from June 13th to 16th. Stay tuned for my related blog posts and my tweets during the day as well.

You can follow my Twitter Account:

Also feel free to send me your asks or questions about it.