I was teaching a microprocessors’ design internals course last fall semester. I planned at the beginning to give my students the opportunity to design a toy microprocessor and optimize important performance factors, such as the pipelining, branch prediction, instructions issuance, etc. However, I decided to link them to the industry and give them a project to implement a simple parallel algorithm on the Cell Broadband Engine and monitor critical performance factors. At the end my objective was to teach them through a real processor possible design tradeoffs and their effect on performance and general effectiveness of the microprocessor. So I had 21 teams of 4 or 5 students working on different discrete algorithms, such as sorting, prime checking, matrix multiplication, and Fibonacci calculations. I asked them to submit a report and write at end of it possible architectural improvements to boost the Cell processor’s performance from their project’s experiences. I found out some interesting conclusions that worth sharing here with you. I reworded some of these suggestions and added some details since they are extracted from a different context.
- Instructions fetching unit inside the SPEs may suffer from starvation if we have a lot of DMA requests that must be served. This can take place because the high priority assigned to the DMA requests inside the SPE. IBM suggests balancing your source code and including IFETCH instruction to give instructions fetching unit time to fetch more instructions from the cache. Some students suggested including a separate instructions cache; this would make instructions fetching independent from the DMA requests or registers load/store instructions. This should solve this problem and avoid some of the coding complexity while programming the Cell. Also given most of the applications written on the Cell, the text size is relatively very small. So if 64KB cache for code is built inside the next generation of the Cell processor may boost performance and guarantee most of the time smooth instructions execution.
- A lot of the vector intrinsics were for vectors of floats. Many operations were required for vectors of integers. Students had to type cast to floats before using many of the vector operations, which of course may provide inaccurate answers and consumes more time.
- Of course the most commonly asked improvement is increasing the LS size inside each SPE. The main reason for some students is to include more buffers and utilize more the multi-buffering technique and better performance at the end.
- Other students went wild and suggested to change the priorities of the DMA, IFETCH, and MEM operations within the SPEs. Instead of having them DMA>MEM>IFETCH, they suggested to invert them to avoid starvation of instructions fetch unit.
- Another worth mentioning suggestion is to create memory allocation function that would guarantee allocation of large data structures to different memory banks, which would reduce the DMA latency. For example, if we need a large array and each range will be accessed by a different SPE, we can allocate this array into different memory banks to avoid hotspots inside memory while the SPEs are executing. It is already done by the IBM’s FFT implementation on the Cell processors.
Of course I filtered out some suggests that are of a common sense to any programmer, such as avoiding the 16 bytes memory alignment. I was impressed by their ability to understand and pinpoint some serious problems inside the CBE in less than 6 weeks period.
CSEN 702 Class: Thanks!