Most probably your knowledge about the Cell processor and curiosity led you to this blog post.

Well, I had the same curiosity after working for a while with the Cell processor. I asked myself this question: Is the DMA latency for all SPEs the same? In other words, if one SPE is making a DMA request for exactly the same size data chunk, would it be delivered to its local storage in the same time of other SPEs?

The short and proven answer is: NO

Each SPE has a different DMA latency due to its physical location or its distance from the memory controller. There is only one memory controller inside the initial Cell implementation. The physical distance from memory controller makes a considerable difference of memory latency from one SPE to another. This latency difference gets event bigger as the DMA chunk gets bigger. For example, nearest SPE to the memory controller retrieves 4 KB from main memory in around 2000 nano seconds, while the farthest SPE receives the same chunk in around 420 nano seconds.

So, what does this mean? Or should I care about this?

Well, this means simply that double or multi buffering does not hide memory latency inside all SPEs with the same efficiency. SPEs located physically near the memory controller can have almost all the memory latency hidden, but far ones may still suffer from some latency.  You can download the code from here and test yourself if you have a Cell machine. It will  not work on the simulator, since it does not simulate the DMA latency, even in cycle mode.

If you are using double buffering, you are still getting a better performance compared to a single buffer. However, you are not getting the best possible performance. There is still more room for improvement.

If you have the new Power Cell 8Xi, this fact might be different since there are two memory controllers on that Cell implementation. Please share your numbers with us if you have it. You can find my measures here.