OpenCL was initially developed by Apple, which holds trademark rights, in collaboration with technical teams at  AMD, IBM, Intel, Nvidia, Ericsson, Nokia, Texas, and Motorola. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008 the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details. On December 8,2008 OpenCL was released.


If you’ve never heard of OpenCL, you need to stop whatever you’re doing and read ahead. We all know that multi-threaded applications have not been as abundant as we had hoped. For those precious few applications that are multi-core aware, few leverage the full potential of two cores. That is why  OpenCL was developed, to standardize parallel programming and execution.

OpenCL  architecture shares a range of computational interfaces with two competitors, NVidia’s Compute Unified Device Architecture and Microsoft’s directCompute.

What is OpenCL?

OpenCL  is a framework for writing programs that execute across  heterogeneous platforms consisting of CPUs , GPUs , Cell, DSP and  other processors. It  includes a language for writing kernels (OS),  plus APIs that are used to define and then control the platforms.

The Khronos Group, hopes that OpenCL will do for multi-core what OpenGL did for graphics, and OpenAL is beginning to do for audio, and that’s exactly what OpenCl achieved. OpenCL improved speed for a wide spectrum of applications from gaming, entertainment to scientific and medical software.

The following link is a link to a  video which shows to what extent OpenCl speeds up the execution of an application.

OpenCl Demo

How does OpenCl work?

OpenCL includes a language for writing compute kernels and APIs for creating and managing these kernels. The kernels are compiled, with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system.

Platform Model

One of OpenCL’s strengths is that this model does not specify exactly what hardware constitutes a compute device. Thus, a compute device may be a GPU, or a CPU.

OpenCL sees today’s heterogeneous world through the lens of an abstract, hierarchical platform model. In this model, a host coordinates execution, transferring data to and from an array of Compute Devices. Each Compute Device is composed of an array of Compute Units, and each Compute Unit is composed of an array of Processing Elements.

Opencl Anatomy

The platform layer API gives the developer access to routines that query for the number and types of devices in the system. The developer can then select and initialize the necessary compute devices to properly run their work load. It is at this layer that compute contexts and work-queues for job submission and data transfer requests are created.

The runtime API allows the developer to queue up compute kernels for execution and is responsible for managing the compute and memory resources in the OpenCL system.

OpenCL Memory Model
OpenCL defines four  memory spaces: private, local, constant and global.

Private memory is memory that can only be used by a single compute unit. This is similar to registers in a single compute unit or a single CPU core.

Local memory is memory that can be used by the work-items in a work-group. This is similar to the local data share that is available on the current generation of AMD GPUs.

Constant memory is memory that can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on AMD GPUs.

Global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs.


The Execution Model

There are three basic components of executable code in OpenCL: Kernels, programs, applications queue kernels.

A compute kernel is the basic unit of executable code and can be thought of as similar to a C function.  Each kernel is called a work item, where each of which has a unique ID.

Execution of such kernels can proceed either in-order or out-of-order depending on the parameters passed to the system when queuing up the kernel for execution. Events are provided so that the developer can check on the status of outstanding kernel execution requests and other runtime requests.

In terms of organization, the execution domain of a kernel is defined by an N-dimensional computation domain. This lets the system know how large of a problem the user would like the kernel to be applied to.

Each element in the execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes.

Executing Kernels, Work-Groups and Work-Items

A program is a collection of kernels and other functions. So a group of kernels are called a program.

Applications queue kernels are queues of kernels which are queued in order and executed in order or out of order.

Since OpenCL is meant to target not only GPUs but also other accelerators, such as multi-core CPUs, flexibility is given in the type of compute kernel that is specified. Compute kernels can be thought of either as data-parallel, which is well-matched to the architecture of GPUs, or task-parallel, which is well-matched to the architecture of CPUs.

Data parallelism:
focuses on distributing the data across different parallel computing nodes.

To achieve data parallelism in OpenCL:

1.define N-Dimensional computation domain

  • Each independent element of execution in N-D domain is called a work-item
  • The N-D domain defines the total number of work items that execute in parallel — global work size.

2.Work-items can be grouped together — work-group

  • Work-items in group can communicate with each other
  • we Can synchronize execution among work-items in group to coordinate memory access

3.Execute multiple work-groups in parallel

example of data parallelism in OpenCL:
Data parallelism

Task parallelism:

focuses on distributing execution processes (threads) across different parallel computing nodes.

this can be achieved by synchronizing work items within a work group.

OpenCL Objects

  • Setup objects:
  1. Devices : Gpu, Cpu, Cell.
  2. Context : collection of devices.
  3. Queues : submit work to the device.
  • Memory objects:
  1. Buffers : Blocks of memory
  2. Image objects : 2D or 3D images.
  • Execution :
  1. programs.
  2. Kernels.

How to submit work to the computing devices in the system?

There are three basic steps to do this:
  1. compile the programs you wrote.
  2. set the arguments and parameters of each kernel  to the desired values and create memory objects and buffers .
  3. use command queues to en queue those kernels and send the code to execution.
After finishing the previous three steps , we must know the number and types of devices and hardware we have.
first you must query for the devices in the system using clGetDeviceIDS .
then create a context to put the devices in so that they can share data and communicate and this is achieved using clCreatContext.
the last thing you have to do is to create command queue to allow us to talk to these devices.
NB. a multi core device is considered one device.

Simple Example – Vector Addition Kernel

The following is a simple vector addition kernel written in OpenCL.You can see that the kernel specifies three memory objects, two for input, a and b, and a single output, c. These are arrays of data that reside in the global memory space. In this example, the compute unit executing this kernel gets its unique work-item ID and uses that to complete its part of the vector addition by reading the appropriate value from a and b and storing the sum into c.

Since, in this example, we will be using online compilation, the above code will be stored in a character array named program_source.

To complement the compute kernel code, the following is the code run on the host processor to:

  • Open an OpenCL context,
  • Get and select the devices to execute on,
  • Create a command queue to accept the execution and memory requests,
  • Allocate OpenCL memory objects to hold the inputs and outputs for the compute kernel,
  • Online compile and build the compute kernel code,
  • Set up the arguments and execution domain,
  • Kick off compute kernel execution, and
  • Collect the results.


It is really hard to decide if OpenCL will continue or not, but i think  that the future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion.Fusion is AMD’s forthcoming CPU + GPU product on one hybrid silicon chip.

This processor would be perfect for OpenCL, As that doesn’t care what type of processor is available; as long as it can be used.