GPGPU



Posting the slides as they come out!
Speaker: Eric Demers, AMD corporate VP and CTO, Graphics Division

20110616-103219.jpg

20110616-102938.jpg

20110616-103646.jpg

20110616-103716.jpg

20110616-103831.jpg

20110616-103948.jpg

20110616-104034.jpg

20110616-104100.jpg

20110616-104234.jpg

20110616-104425.jpg

20110616-104459.jpg

20110616-104540.jpg

20110616-104818.jpg

20110616-104843.jpg

20110616-105131.jpg

20110616-105205.jpg

20110616-105235.jpg

20110616-105309.jpg

20110616-105337.jpg

20110616-105434.jpg

20110616-105555.jpg

20110616-105856.jpg

20110616-110025.jpg

20110616-110345.jpg

20110616-110555.jpg

20110616-110806.jpg

20110616-110948.jpg

20110616-111048.jpg

20110616-111446.jpg

20110616-111542.jpg

20110616-111832.jpg


Here you go slides I could capture in this session

20110616-093636.jpg

20110616-093752.jpg

20110616-095848.jpg

20110616-100013.jpg

20110616-100124.jpg

20110616-100401.jpg

20110616-100525.jpg

20110616-100623.jpg

20110616-100749.jpg

20110616-100839.jpg

20110616-100936.jpg

20110616-101046.jpg


20110615-010317.jpg

20110615-010724.jpg

20110615-010353.jpg

20110615-010601.jpg

20110615-010944.jpg

20110615-011010.jpg

20110615-011236.jpg

20110615-011344.jpg

20110615-011431.jpg

20110615-011545.jpg

20110615-011656.jpg

20110615-011741.jpg

20110615-012206.jpg

20110615-012241.jpg

20110615-012409.jpg


Here you go the slides

20110615-105032.jpg

20110615-105100.jpg

20110615-105254.jpg

20110615-105323.jpg

20110615-105421.jpg

20110615-105521.jpg

20110615-105635.jpg

20110615-105729.jpg

20110615-105758.jpg

20110615-105833.jpg

20110615-110023.jpg

20110615-110112.jpg

20110615-110153.jpg

20110615-110217.jpg

20110615-110306.jpg

20110615-110345.jpg

20110615-110413.jpg

20110615-110433.jpg

20110615-110537.jpg

20110615-110605.jpg

20110615-110704.jpg

20110615-110732.jpg

20110615-110752.jpg

20110615-110812.jpg

20110615-111001.jpg

20110615-111040.jpg

20110615-111335.jpg

20110615-111436.jpg

20110615-111548.jpg

20110615-111646.jpg

20110615-111816.jpg

20110615-111836.jpg


Here you go day 2 keynote slides. For agility, I just posted them from my iPhone;

20110615-084714.jpg

20110615-084725.jpg

20110615-084650.jpg

20110615-085702.jpg

20110615-085423.jpg

20110615-085850.jpg

20110615-085815.jpg

20110615-085934.jpg

20110615-084911.jpg

20110615-090011.jpg

20110615-090044.jpg

20110615-085325.jpg

20110615-091256.jpg

20110615-091351.jpg

20110615-091816.jpg

20110615-091613.jpg

20110615-091545.jpg

20110615-091936.jpg

20110615-091435.jpg

20110615-092037.jpg

20110615-085057.jpg

20110615-085251.jpg

20110615-092349.jpg

20110615-092407.jpg

20110615-092612.jpg

20110615-092635.jpg


Folks,

I’m out of energy to write and post right now. I decided to post photos of the slides of this panel discussion. Panelists were:

  • Richard Vuduc, Georgia Tech.
  • Wu-Chun Feng, Virginia Tech
  • Charles Moore, AMD


Folks,

More notes for an interesting session about APUs performance. You may not find this somewhere else.

PS: if you were at that session and have some extra content/material to post here, please let me know.

[Update: I have some performance figures but the image is not clear. I’ll try to decipher it when I have some rest]

So, what’s really new in AMD’s APUs?

One of the key parts of the system of the system is the data path between the GPU and memory

  • Provide low latency access for CPU cores (optimized around caches)
    • Random access, branchy, single threaded, scalar code
  • Provides high throughput access for GPU cores (optimized around latency hiding)
    • Streaming, vectorized, massively multithreaded, data-intensive code
  • LIano introduced two new buses for the GPU to access memory:
    • AMD fusion compute link (ONION):
      • This bus is sued by the GPU when it needs to snoop the CPU cache, so is coherent bus
      • This is used for cacheable system memory
        • Radeon memory bus (GARLIC)
          • This bus is directly connect to memory and can saturate memory bandwidth, so is

The GPU in the Llano system

  • On llano, the GPU core is still exposed as a separate graphics engine
    • The GPU is managed by the OS via drivers
      • Leverage existing driver stacks to support the current ecosystem
    • Memory is split into regular system memory, and carved out “local memory”
    • Allows the GPU memory controller to optimize throughput and priorities of the GFX clients
  • Existing and familiar APIs can be used to access the GPU core
    • OpenCL, OpenGL, DirectX, and multimedia ………….

GPU in the system

  • Both CPU and GPU have their own set of page tables, caches and TLB
    • The memory is generally not coherent
    • The GPU can probe the CPU cache…..
    • …. But the CPU relies on the driver for synchronization (map/unmap, lock/unlock, flush GPU caches)
  • The current programming model is direct consequence:
    • CPU access will page fault on a single access, and the OSwill page in/out on demand
    • GPU access is known upfront, and driver or OS will page in/out on scheduling. (NOT ON DEMAND)

What is Zero copy?

  • Many different meanings:
    • A kernel access system memory directly for either read or write
    • A DMA transfer access system memory directly without copying into USWC
    • The CPU directly writes into local memory without doing any DMA
  • OpenCL offers several mechanisms to effectively reduce extract copying
  • OpenGL has some driver optimization and some proprietary extensions
    • on Llano, this matters even more than on discrete because bandwidth is shared

CPU & GPU Memory Move Scenarios

CPU access to local memory

  • CPU writes into local frame buffer
    • on llano, this can peak at 8 GB/s
      • on discrete, this was limited by the PCIe bus to around 6 GB/s ( or less)
    • the data first goes through the WC buffers on the CPU, then goes to the GPU core and goes back through the unb to memory
  • CPU reads from local framebuffer
    • those are still very slow
      • accesses are uncached
      • only a single outstanding read is supported
      • create the buffer with CL_MEM_USE_PERSISTENT_MEM_AMD flag (OpenCL)

CPU access to USWC memory

  • the CPU writes go through the WC
    • this avoids polluting the CPU cache, when it is known that thre will be no cache hit for reads
    • this allows further access by the GPU for this memory without snooping the cache
  • CPU reads will first flush the WC, then will be uncached (slow)

CPU access to a cacheable memory

  • CPU access to cacheable memory
    • this is the typical case in c++ code
    • single threaded performance: 8GB/s for either read or write
    • multithreaded performance: 13 GB/s for either read ro write
  • the memory can be accessed by the GPU
    • pages need to be made resistant by the os, and locked to prevent paging
    • physical pages need to be programmed into the GPU HW virtual memory page tables

CPU access to local memory

  • GPU reads from local frambuffer
    • this is the optimal path to memory
      • tadeon memory bus (GARLIC)_ avoids any can ceyh snooping
      • memory is interleaved to increase throughput efficiency
    • kernels and shaders can saturate DRAM bandwidth (measured at 17 GB/s)
  • GPU writes to local framebuffer are similar (i.e. memcopy)
    • kernel and shaders can saturate dram bandwidth (measured at 13 GB/s)

GPU access to USWC memory

  • GPU accesses the USWC memory uses the Radeon memory bus (GARLIC)
    • memory does not have the same interleaving granularity as local memory
    • so slightly lower performance than local memory, but faster than cacheable memory
    • reads can saturate dram bandwidth (measured at 12 GB/s)_
    • writes are similarly fast but …
      • usually avoided, however, since CPU reads are really slow from uncached space

GPU access to cacheable memory

  • GPU access to cacheable memory
    • this can be used directly by a kernel or for data upload to the GPU

Terminology

  • WC: write combine buffers
    • There are 4 WC buffers per core
      • Once WC buffer is automatically assigned for a write operation
      • If the writes are contiguous, then it is efficient
      • If there are many noncontiguous writes, then partial WC flushes will lower the efficiency
    • The WC buffers are automatically flushed to memory when the GPU is accessed
  • Cacheable memory
    • This is the traditional L1/L2 architecture on AMD CPUs
    • CPU accesses are fast for both read and write
    • Multithreading (from multiple cores) is often necessary to saturate full bandwidth
    • When the GPU access this type of memory, the caches are snooped to ensure coherency.
  • USWC: Uncached speculative write combined
    • CPU reads are uncached (slow), CPU writes got through the WC buffers
    • GPU access to this type of memory does not need CPU cache probing
  • Local video memory:
    • Memory managed by the graphics driver, not available to the OS for generic CPU processes
  • Memory pinning and locking
    • Operation done by the OS for access of the system pages by the GPU:
      • Make the page resident (no long in the swap file)
      • Remove this page from regular CPU paging operation
      • Program the GPU virtual memory to map the pages into a contiguous
  • TLB: Translation Lookaside buffer
    • A dedicated cache used to store the result of page transaction (both CPU and GPU)
  • UNB: unified north bridge
    • Arbitrates memory traffic from the GPU client, and CPU cores.

Next Page »