Hey,

In my earlier blog post I quickly went through the perspective of the CPUs and GPUs to scale out their performance. I also mentioned how the APU is trying to harness the goodness of both worlds. Let me quickly this time go through a simple example and show and the APUs would present an excellent platform to solve this problem.

Consider the problem of parallel summation across a very large array. How would you solve this problem on a CPU? Here is the pseudo code:

  1. Take an input array.
  2. Block it based on the number of threads (usually one per core – 4 or 8 cores).
  3. Iterate to produce a sum in each block.
  4. Reduce across threads.
  5. Vectorize your execution step through the SIMD ISA.

Have a look at the code below

  1. //Summation Across all threads
  2. float4 sum(0,0,0,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. float scalarSum = sum.x +sum.y + sum.z + sum.w;
  6. //Reduction stage to aggregate threads results
  7. float reductionValue(0);
  8. for (t <= threads_num)
  9.  reductionValue += t_sum;

Think now of an efficient implementation on the GPU:

  1. Take the input array.
  2. Block it based on the number of threads (16 per core it could be up to 64 per core).
  3. Iterate to produce a sum in each block.
  4. Reduce/Sum across threads.
  5. Vectorize through a different kernel call due to the limitations of the current execution models.
  1. //Summation Across all threads
  2. float64 sum(0,…,0);
  3. for (i=(n/threads_count)*thread_num to (n+b)/threads_num)
  4.  Sum += input[i];
  5. //Reduction stage to aggregate threads results
  6. float reductionValue(0);
  7. for (t <= threads_num)
  8.  reductionValue += t_sum;

They don’t look so different from each other, right? Basically you do the same steps but the main differences are the number of cores and the number of threads. On GPUs you have more way more threads to do the summation, which may complicate your model. In addition, these many threads bring with them a lot of state management overheads, context switching, and problematic stack management. On the CPU cores you may have data parallelism through the limited number of cores and threads. Narrow SIMD units simplify the problem. High clock rates and caches make serial execution efficient for each single thread. Also the simple mapping of tasks to threads allows us to create complex tasks graphs. However, this comes at the cost of many iterations for loops. So in other words, GPUs support very fine-grained data parallel execution and CPUs provide coarse-grained data parallel execution model.

APUs combine these by providing a nested data parallel code. Basically, CPUs take coarse-grained tasks and break them down to the on-chip GPUs to do faster execution of finer grained tasks. Close coupling of the CPUs and GPUs elemenates the cost of moving data between them to execute this nested data parallel model. Also, CPUs can handle conditional data parallel execution much better than GPUs; offloading computations becomes more efficient since there is virtually zero data copying for this offloading process.

Applications can now combine high and low degree of threading at almost zero cost. Also, interesting execution models are possible. You can have multiple kernels execution on the simultaneously communicate through shared buffer and relatively low synchronization overhead. So back to our example, we can now divide our array to the four CPU cores and each core then can offload the summation to the GPU threads, do the reduction at its level, and then all the CPUs can synchronize and do the reduction with very low overhead.

So, this is in terms the possibilities on the APU architecture.

The question now is: how can we easily use all these capabilities without scarifying performance? Moving from the explicit data movement between CPUs and GPUs to the shared memory spaces is tricky. CPUs use explicit vectors ISA and memory access patterns, but GPUs depend on implicit vectors through multiple threads scheduled to access adjacent memory locations simultaneously. How can these two models be targeted in an easy clear programming model with an acceptable efficiency and true shared memory that we can freely pass pointers to between the CPU and GPU cores? This will be my next blog post. Stay tuned!

Advertisements