More notes for an interesting session about APUs performance. You may not find this somewhere else.

PS: if you were at that session and have some extra content/material to post here, please let me know.

[Update: I have some performance figures but the image is not clear. I’ll try to decipher it when I have some rest]

So, what’s really new in AMD’s APUs?

One of the key parts of the system of the system is the data path between the GPU and memory

  • Provide low latency access for CPU cores (optimized around caches)
    • Random access, branchy, single threaded, scalar code
  • Provides high throughput access for GPU cores (optimized around latency hiding)
    • Streaming, vectorized, massively multithreaded, data-intensive code
  • LIano introduced two new buses for the GPU to access memory:
    • AMD fusion compute link (ONION):
      • This bus is sued by the GPU when it needs to snoop the CPU cache, so is coherent bus
      • This is used for cacheable system memory
        • Radeon memory bus (GARLIC)
          • This bus is directly connect to memory and can saturate memory bandwidth, so is

The GPU in the Llano system

  • On llano, the GPU core is still exposed as a separate graphics engine
    • The GPU is managed by the OS via drivers
      • Leverage existing driver stacks to support the current ecosystem
    • Memory is split into regular system memory, and carved out “local memory”
    • Allows the GPU memory controller to optimize throughput and priorities of the GFX clients
  • Existing and familiar APIs can be used to access the GPU core
    • OpenCL, OpenGL, DirectX, and multimedia ………….

GPU in the system

  • Both CPU and GPU have their own set of page tables, caches and TLB
    • The memory is generally not coherent
    • The GPU can probe the CPU cache…..
    • …. But the CPU relies on the driver for synchronization (map/unmap, lock/unlock, flush GPU caches)
  • The current programming model is direct consequence:
    • CPU access will page fault on a single access, and the OSwill page in/out on demand
    • GPU access is known upfront, and driver or OS will page in/out on scheduling. (NOT ON DEMAND)

What is Zero copy?

  • Many different meanings:
    • A kernel access system memory directly for either read or write
    • A DMA transfer access system memory directly without copying into USWC
    • The CPU directly writes into local memory without doing any DMA
  • OpenCL offers several mechanisms to effectively reduce extract copying
  • OpenGL has some driver optimization and some proprietary extensions
    • on Llano, this matters even more than on discrete because bandwidth is shared

CPU & GPU Memory Move Scenarios

CPU access to local memory

  • CPU writes into local frame buffer
    • on llano, this can peak at 8 GB/s
      • on discrete, this was limited by the PCIe bus to around 6 GB/s ( or less)
    • the data first goes through the WC buffers on the CPU, then goes to the GPU core and goes back through the unb to memory
  • CPU reads from local framebuffer
    • those are still very slow
      • accesses are uncached
      • only a single outstanding read is supported
      • create the buffer with CL_MEM_USE_PERSISTENT_MEM_AMD flag (OpenCL)

CPU access to USWC memory

  • the CPU writes go through the WC
    • this avoids polluting the CPU cache, when it is known that thre will be no cache hit for reads
    • this allows further access by the GPU for this memory without snooping the cache
  • CPU reads will first flush the WC, then will be uncached (slow)

CPU access to a cacheable memory

  • CPU access to cacheable memory
    • this is the typical case in c++ code
    • single threaded performance: 8GB/s for either read or write
    • multithreaded performance: 13 GB/s for either read ro write
  • the memory can be accessed by the GPU
    • pages need to be made resistant by the os, and locked to prevent paging
    • physical pages need to be programmed into the GPU HW virtual memory page tables

CPU access to local memory

  • GPU reads from local frambuffer
    • this is the optimal path to memory
      • tadeon memory bus (GARLIC)_ avoids any can ceyh snooping
      • memory is interleaved to increase throughput efficiency
    • kernels and shaders can saturate DRAM bandwidth (measured at 17 GB/s)
  • GPU writes to local framebuffer are similar (i.e. memcopy)
    • kernel and shaders can saturate dram bandwidth (measured at 13 GB/s)

GPU access to USWC memory

  • GPU accesses the USWC memory uses the Radeon memory bus (GARLIC)
    • memory does not have the same interleaving granularity as local memory
    • so slightly lower performance than local memory, but faster than cacheable memory
    • reads can saturate dram bandwidth (measured at 12 GB/s)_
    • writes are similarly fast but …
      • usually avoided, however, since CPU reads are really slow from uncached space

GPU access to cacheable memory

  • GPU access to cacheable memory
    • this can be used directly by a kernel or for data upload to the GPU


  • WC: write combine buffers
    • There are 4 WC buffers per core
      • Once WC buffer is automatically assigned for a write operation
      • If the writes are contiguous, then it is efficient
      • If there are many noncontiguous writes, then partial WC flushes will lower the efficiency
    • The WC buffers are automatically flushed to memory when the GPU is accessed
  • Cacheable memory
    • This is the traditional L1/L2 architecture on AMD CPUs
    • CPU accesses are fast for both read and write
    • Multithreading (from multiple cores) is often necessary to saturate full bandwidth
    • When the GPU access this type of memory, the caches are snooped to ensure coherency.
  • USWC: Uncached speculative write combined
    • CPU reads are uncached (slow), CPU writes got through the WC buffers
    • GPU access to this type of memory does not need CPU cache probing
  • Local video memory:
    • Memory managed by the graphics driver, not available to the OS for generic CPU processes
  • Memory pinning and locking
    • Operation done by the OS for access of the system pages by the GPU:
      • Make the page resident (no long in the swap file)
      • Remove this page from regular CPU paging operation
      • Program the GPU virtual memory to map the pages into a contiguous
  • TLB: Translation Lookaside buffer
    • A dedicated cache used to store the result of page transaction (both CPU and GPU)
  • UNB: unified north bridge
    • Arbitrates memory traffic from the GPU client, and CPU cores.