Here you go quick notes from the OpenCL introductory session at the AFDS 2011. Will refine them add details later today.
Introduction to OpenCL Programming
- Targeting beginners of OpenCL programming
- It is a heterogeneous world: many CPUs and many GPUs
The multi-million dollar questions
- How do avoid developing and maintaining different source code versions?
- Lower throughput, lower latency
- High ALU, high memory bandwidth, higher latency
- Bandwidth in the order of hundreds of GB/s
- Transfer over the PCIe
- DX11 class, shares system memory with CPU
- Bandwidth in the order of tens of GS/s
- Zero copy
What’s open cl
Open specs for programming on heterogeneous systems
- Multi-core CPUs
- Massively parallel GPUs
- Cell, FPGAs, etc.
- Industry standard
- Open specification
- Windows, Linux, mac OS
- AMD, apple, Creative, IBM, ……
How to execute a program on the device (GPU)?
- Performs GPU calculations
- Reads from, and writes to memory
Based on C
- No recursion, etc.
- Vector data types (int 4)
- Built in functions (sin, cos)
How to control the device (GPU)
- C API
- Initialize the GPU
- Allocate memory buffers on GPU
- Send data to GPU
- Rank kernel to GPU
- Read data from GPU
- Commands are queued.
- Kernel Anatomy
- Work items vs work groups
- Memory is consistent only after the work barriers
- Barriers are only within the workgroups. Cannot do it globally.
Mapping work-groups on GPUs
- Work-groups are distributed across SIMDs
- At minimum, work-group size should equal Wafefront size
- Advanced tip: GPUs typically need multiple Wavefronts per SIMD for latency hiding
Execution on GPUs
- Work groups are scheduled to SIMDs engines
- We can synchronize within work group
- Cannot synch across work groups
Execution on CPU
- Each core gets a workgroup
- Reduction kernel to get the sum of large array of elements?
Why do we need a barrier ?
- To guarantee memory consistency across Wavefronts