Here you go quick notes from the OpenCL introductory session at the AFDS 2011. Will refine them add details later today.


Introduction to OpenCL Programming


  • Targeting beginners of OpenCL programming
  • It is a heterogeneous world: many CPUs and many GPUs
  • The multi-million dollar questions
    • How do avoid developing and maintaining different source code versions?
  • CPUs
    • Lower throughput, lower latency
  • GPU
    • High ALU, high memory bandwidth, higher latency
    • Bandwidth in the order of hundreds of GB/s
    • Transfer over the PCIe
  • Fusion GPU
    • DX11 class, shares system memory with CPU
    • Bandwidth in the order of tens of GS/s
    • Zero copy
  • What’s open cl
    • Open specs for programming on heterogeneous systems
      • Multi-core CPUs
      • Massively parallel GPUs
      • Cell, FPGAs, etc.
    • Industry standard
    • Open specification
    • Cross-platform
      • Windows, Linux, mac OS
    • Multi-vendor
      • AMD, apple, Creative, IBM, ……
  • Overview
    • How to execute a program on the device (GPU)?
      • Kernel
        • Performs GPU calculations
        • Reads from, and writes to memory
      • Based on C
        • Restrictions
          • No recursion, etc.
        • Additions
          • Vector data types (int 4)
          • Synchronization
          • Built in functions (sin, cos)
    • How to control the device (GPU)
      • Host program
        • C API
      • Steps
  1. Initialize the GPU
  2. Allocate memory buffers on GPU
  3. Send data to GPU
  4. Rank kernel to GPU
  5. Read data from GPU
  • Commands are queued.
  • Kernel Anatomy
  • Work items vs work groups
  • Memory spaces
    • Memory is consistent only after the work barriers
    • Barriers are only within the workgroups. Cannot do it globally.
  • Mapping work-groups on GPUs
    • Work-groups are distributed across SIMDs
    • At minimum, work-group size should equal Wafefront size
    • Advanced tip: GPUs typically need multiple Wavefronts per SIMD for latency hiding
  • Execution on GPUs
    • Work groups are scheduled to SIMDs engines
    • We can synchronize within work group
    • Cannot synch across work groups
  • Execution on CPU
    • Each core gets a workgroup
  • Sample program
    • Reduction kernel to get the sum of large array of elements?
  • Why do we need a barrier ?
    • To guarantee memory consistency across Wavefronts