Folks,
More notes for an interesting session about APUs performance. You may not find this somewhere else.
PS: if you were at that session and have some extra content/material to post here, please let me know.
[Update: I have some performance figures but the image is not clear. I'll try to decipher it when I have some rest]
So, what’s really new in AMD’s APUs?
One of the key parts of the system of the system is the data path between the GPU and memory
-
Provide low latency access for CPU cores (optimized around caches)
- Random access, branchy, single threaded, scalar code
-
Provides high throughput access for GPU cores (optimized around latency hiding)
- Streaming, vectorized, massively multithreaded, data-intensive code
-
LIano introduced two new buses for the GPU to access memory:
-
AMD fusion compute link (ONION):
- This bus is sued by the GPU when it needs to snoop the CPU cache, so is coherent bus
-
This is used for cacheable system memory
-
Radeon memory bus (GARLIC)
- This bus is directly connect to memory and can saturate memory bandwidth, so is
-
-
The GPU in the Llano system
-
On llano, the GPU core is still exposed as a separate graphics engine
-
The GPU is managed by the OS via drivers
- Leverage existing driver stacks to support the current ecosystem
- Memory is split into regular system memory, and carved out “local memory”
- Allows the GPU memory controller to optimize throughput and priorities of the GFX clients
-
-
Existing and familiar APIs can be used to access the GPU core
- OpenCL, OpenGL, DirectX, and multimedia ………….
GPU in the system
-
Both CPU and GPU have their own set of page tables, caches and TLB
- The memory is generally not coherent
- The GPU can probe the CPU cache…..
- …. But the CPU relies on the driver for synchronization (map/unmap, lock/unlock, flush GPU caches)
-
The current programming model is direct consequence:
- CPU access will page fault on a single access, and the OSwill page in/out on demand
- GPU access is known upfront, and driver or OS will page in/out on scheduling. (NOT ON DEMAND)
What is Zero copy?
-
Many different meanings:
- A kernel access system memory directly for either read or write
- A DMA transfer access system memory directly without copying into USWC
- The CPU directly writes into local memory without doing any DMA
- OpenCL offers several mechanisms to effectively reduce extract copying
-
OpenGL has some driver optimization and some proprietary extensions
- on Llano, this matters even more than on discrete because bandwidth is shared
CPU & GPU Memory Move Scenarios
CPU access to local memory
-
CPU writes into local frame buffer
-
on llano, this can peak at 8 GB/s
- on discrete, this was limited by the PCIe bus to around 6 GB/s ( or less)
- the data first goes through the WC buffers on the CPU, then goes to the GPU core and goes back through the unb to memory
-
-
CPU reads from local framebuffer
-
those are still very slow
- accesses are uncached
- only a single outstanding read is supported
- create the buffer with CL_MEM_USE_PERSISTENT_MEM_AMD flag (OpenCL)
-
CPU access to USWC memory
-
the CPU writes go through the WC
- this avoids polluting the CPU cache, when it is known that thre will be no cache hit for reads
- this allows further access by the GPU for this memory without snooping the cache
- CPU reads will first flush the WC, then will be uncached (slow)
CPU access to a cacheable memory
-
CPU access to cacheable memory
- this is the typical case in c++ code
- single threaded performance: 8GB/s for either read or write
- multithreaded performance: 13 GB/s for either read ro write
-
the memory can be accessed by the GPU
- pages need to be made resistant by the os, and locked to prevent paging
- physical pages need to be programmed into the GPU HW virtual memory page tables
CPU access to local memory
-
GPU reads from local frambuffer
-
this is the optimal path to memory
- tadeon memory bus (GARLIC)_ avoids any can ceyh snooping
- memory is interleaved to increase throughput efficiency
- kernels and shaders can saturate DRAM bandwidth (measured at 17 GB/s)
-
-
GPU writes to local framebuffer are similar (i.e. memcopy)
- kernel and shaders can saturate dram bandwidth (measured at 13 GB/s)
GPU access to USWC memory
-
GPU accesses the USWC memory uses the Radeon memory bus (GARLIC)
- memory does not have the same interleaving granularity as local memory
- so slightly lower performance than local memory, but faster than cacheable memory
- reads can saturate dram bandwidth (measured at 12 GB/s)_
-
writes are similarly fast but …
- usually avoided, however, since CPU reads are really slow from uncached space
GPU access to cacheable memory
-
GPU access to cacheable memory
- this can be used directly by a kernel or for data upload to the GPU
Terminology
-
WC: write combine buffers
-
There are 4 WC buffers per core
- Once WC buffer is automatically assigned for a write operation
- If the writes are contiguous, then it is efficient
- If there are many noncontiguous writes, then partial WC flushes will lower the efficiency
- The WC buffers are automatically flushed to memory when the GPU is accessed
-
-
Cacheable memory
- This is the traditional L1/L2 architecture on AMD CPUs
- CPU accesses are fast for both read and write
- Multithreading (from multiple cores) is often necessary to saturate full bandwidth
- When the GPU access this type of memory, the caches are snooped to ensure coherency.
-
USWC: Uncached speculative write combined
- CPU reads are uncached (slow), CPU writes got through the WC buffers
- GPU access to this type of memory does not need CPU cache probing
-
Local video memory:
- Memory managed by the graphics driver, not available to the OS for generic CPU processes
-
Memory pinning and locking
-
Operation done by the OS for access of the system pages by the GPU:
- Make the page resident (no long in the swap file)
- Remove this page from regular CPU paging operation
- Program the GPU virtual memory to map the pages into a contiguous
-
-
TLB: Translation Lookaside buffer
- A dedicated cache used to store the result of page transaction (both CPU and GPU)
-
UNB: unified north bridge
- Arbitrates memory traffic from the GPU client, and CPU cores.
Advertisement
June 17, 2011 at 8:48 pm
Thank you. I missed this session.
June 18, 2011 at 8:41 am
Thanks Colin,
Stay tuned. I’ll post few blog posts commenting on several sessions and revealing new content.
November 5, 2011 at 6:34 am
e-like.ro…
[...]Memory Model on Fusion APUs and the Benefit of Zero-Copy Approaches « Personal Blog of Mohamed F. Ahmed[...]…