The Cell Broadband Engine



ABSTRACT

The slowing pace of commodity microprocessor performance improvements combined with ever increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache based designs. In this work, we examine the potential of using the forthcoming Cell Processors as a building block for future high end computing systems. Our work contains several novel contributions. First, we give a background about the Cell Processor & it’s implementation history. Next, we give an overview about the architecture of the Cell Processors. Additionally, we compare Cell performance to benchmarks run on modern other architectures. Then, we derive some examples apply the Cell Broadband Engine Architecture technology to their designs. Also we give a brief about Cell software development & engineering under cell architecture & how software can doubles it’s efficiency & speed if implemented to use Cell Architecture. Finally we discuss the future of the Cell Architecture as one of the most advanced architectures in the market.

1. INTRODUCTION

Sony Computer Entertainment, Toshiba Corporation & IBM had formed an alliance in year 2000 to design and manufacture a new generation of processors. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM’s design centers. Cell combines a general purpose Power Architecture core of modest performance with streamlined co-processing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The Cell architecture includes a novel memorycoherence architecture for which IBM received many patents. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. In 2005, Sony Computer Entertainment had confirmed some specifications of the Cell processor that is being shipped in it’s famous gaming console Play Station 3 console. This Cell configuration have one Power processing element (PPE) on the core, with eight physical SPE3 in silicon. This PS3’s Cell is the first Cell Architecture to be in the market. Although the Cell processor of the PS3 is not that advanced compared to current cell architectures being developed in IBM plants, it competed the most advanced processors in the market proving the architecture’s efficiency.

2. CELL ARCHITECTURE

Cell takes a radical departure from conventional multiprocessor or multicore

architectures. In stead of using identical cooperating commodity processors, it uses a

conventional high performance PowerPC core that controls eight simple SIMD cores, called

synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller. Access to external memory is handled via a 25.6GB/s XDR memory controller. The cache coherent PowerPC core, the eight SPEs, the DRAM controller, and I/O controllers are all connected via 4 data rings, collectively known as the EIB. The ring interface within each unit allows 8 bytes/cycle to be read or written. Simultaneous transfers on the same ring are possible. All transfers are orchestrated by the PowerPC core. Each SPE includes four single precision (SP) 6 cycle pipelined FMA datapaths and one double precision (DP) halfpumped (the double precision operations within a SIMD operation must be serialized) 9 cycle pipelined FMA datapath with

4 cycles of overhead for data movement. Cell has a 7 cycle in order execution pipeline and forwarding network. IBM appears to have solved the problem of inserting a 13 (9+4) cycle DP pipeline into a 7 stage in order machine by choosing the minimum effort/performance/power solution of simply stalling for 6 cycles after issuing a DP instruciton. And now we have to take each element individually to define it and give a brief about it.

2.1 Power Processor Element

The PPE is the Power Architecture based, two way multi threaded core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 KiB instruction and a 32 KiB data Level 1 cache and a 512 KiB Level 2 cache. Additionally, IBM has included an AltiVec unit which is fully pipelined for single precision floating point. (Altivec does not support double precision floating point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar fused multiply add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fusedmultiplyadd instruction, which translates to 25.6 GFLOPS at 3.2 GHz.

2.2 Synergistic Processing Elements (SPE)

Each SPE is composed of a “Synergistic Processing Unit”, SPU, and a “Memory Flow Controller”, MFC (DMA, MMU, and bus interface). An SPE is a RISC processor with 128bit SIMD organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB

embedded SRAM for instruction and data, called “Local Storage” (not to be mistaken for “Local Memory” in Sony’s documents that refer to the VRAM) which is visible to the PPE and can be

addressed directly by software. Each SPE can support up to 4 GiB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128bit, 128 entry register file and measures 14.5 mm² on a 90 nm process. An SPE can operate on 16 8bit integers, 8 16bit integers, 4 32bit integers, or 4 single precision floatingpoint numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64bit virtual

memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a settop box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. Compared to a modern personal computer,

the relatively high overall floating point performance of a Cell processor seemingly dwarfs

the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. However, comparing only floating point abilities of a system is a one dimensional and application specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors. The Cell is designed to compensate for this with compiler assistance,

in which prepare to branch instructions are created. For double precision, as often used in personal computers, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS. Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication. Toshiba has developed a powered by four SPEs, but no PPE, called the Spurs Engine designed to accelerate 3D and movie effects in consumer electronics.

2.3 Element Interconnect Bus (EIB)

The EIB is a communication bus internal to the Cell processor which connects the various on chip

system elements: the PPE processor, the memory controller (MIC), the eight SPE co-processors, and two off chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as ‘units’. The EIB is presently implemented as a circular ring comprised of four 16Bwide unidirectional channels which counter rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU’s ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model. Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the

overall performance of the EIB as they reduce available concurrency. Despite IBM’s original desire to implement the EIB as a more powerful crossbar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the

worst case, the programmer must take extra care to schedule communication patterns where the EIB is

able to function at high concurrency levels.

2.4 Memory controller and I/O

Cell contains a dual channel next generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIOXDR link runs at 3.2 Gbit/s per pin. Two 32bit channels can provide a theoretical maximum of 25.6 GB/s. The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8bit wide point to point path. Five 8bit wide point to point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound 4

outbound lanes are supporting memory coherency.

3. PERFORMANCE

High performance computing aims at maximizing the performance of grand challenge problems such as protein folding and accurate real time weather prediction. Where in the past, performance improvements were obtained by aggressive frequency scaling using micro architecture and manufacturing techniques, technology limits require future performance improvements be obtained from exploiting parallelism with a multi core design approach. The Cell Broadband Engine is an

exciting new execution platform answering this design challenge for compute intensive applications that reflects both the requirements of future computational workloads and manufacturing constraints.

The Cell B.E. is a heterogeneous chip multiprocessor architecture with compute accelerators achieving in excess of 200 Gflops per chip. The simplicity of the SPEs and the deterministic behavior of the explicitly controlled memory hierarchy make Cell amenable to performance prediction using a

simple analytic model. Using this approach, one can easily explore multiple variations of

an algorithm without the effort of programming each variation and running on either a fully cycle accurate simulator or hardware. With the newly released cycle accurate simulator (Mambo), we have successfully validated our performance model for SGEMM, SpMV, and Stencil Computations, as will be shown in the subsequent sections. Our modeling approach is broken into two steps commensurate with the two phase double buffered computational model. The kernels were first segmented into codes that operate only on data present in the local store of the SPE. We sketched the code snippets in SPE assembly and performed static timing analysis. The latency of each operation, issue width limitations, and the operand alignment requirements of the SIMD/quadword SPE execution pipeline determined the number of cycles required. The inorder nature and fixed local store memory latency of the SPEs makes the analysis deterministic and thus more tractable than on cache based, out of order microprocessors. In the second step, we construct a model that tabulates the time required for DMA loads and stores of the

operands required by the code snippets. The model accurately reflects the constraints imposed by

resource conflicts in the memory subsystem. For instance, concurrent DMAs issued by multiple SPEs must be serialized, as there is only a single DRAM controller. The model also presumes a conservative fixed DMA initiation latency of 1000 cycles. The model computes the total time by adding all the (outer loop) times, which are themselves computed by taking the maximum of the snippet and DMA transfer times. In some cases, the periteration times are constant across iterations, but in others it varies between iterations and is inputdependent. For example, in a sparse matrix, the memory access pattern depends on the nonzero structure of the matrix, which varies across iterations. Some algorithms may also require separate stages which have different execution times; e.g., the FFT has stages for loading data, loading constants, local computation, transpose, local computation, bit reversal, and storing the

results. For simplicity we chose to model a 3.2GHz, 8 SPE version of Cell with 25.6GB/s of memory bandwidth. This version of Cell is likely to be used in the first release of the Sony PlayStation3. The lower frequency had the simplifying benefit that both the EIB and DRAM controller could deliver two SP words per cycle. The maximum flop rate of such a machine would be 204.8 Gflop/s, with a computational intensity of 32 FLOPs/ word.

4. IMPLEMENTATIONS

Many products are being implemented right now using Cell Processors those new hardware applications will change the aspect of performance in the world. Depending on the Cell Processors as

the brain power of those applications double the performance giving new experience to users. Now

we will derive some of those applications being implemented in many advanced technological

institutes in the world.

    1. Blade Server

IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.

4.2 Console Video Games

Sony’s Play Station 3 game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the

yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS. Although PS3’s games graphics are so advanced and heavy it runs so smoothly thanks to the Cell processor cores.

4.3 Home Cinema

Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG2 streams simultaneously on a 1920×1080 screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.

4.4 Super Computing

IBM’s new planned supercomputer, IBM Roadrunner, will be a hybrid of General Purpose CISC as well as Cell processors. It is reported that this combination will produce the first computer to run at petaflop speeds. It will use an updated version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128bit registers, reaching double precision 100 GFLOPs.

4.5 Cluster Computing


Clusters of PlayStation 3 consoles are an attractive alternative to highend systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8 node and 32 node PS3 clusters with Yellow Dog Linux preinstalled, an implementation of Dongarra’s research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using

PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. The computational

Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.

4.6 Distributed Computing

With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project Folding@Home has been recognized by Guinness World Records as the most powerful distributed network in the world. The first record was achieved on September 16,

2007, as the project surpassed one petaFLOPS, which had never been reached before by a

distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the

petaFLOPS mark on September 23, 2007. In comparison, the world’s most powerful supercomputer, IBM’s BlueGene/L, performs around 280.6 teraFLOPS. This means Folding@Home’s computing power is approximately four times BlueGene/L’s (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)

5. SOFTWARE ENGINEERING

Software development for the cell microprocessor involve a mixture of conventional development practices for the POWER architecturecompatible PPU core, and novel software development challenges with regards to the functionally reduced SPU co processors. As we knew from previous sections that Cell processors are multicored with very high efficient parallelism, Software applications can double their performance if they made use of this architecture. For example IBM implemented a Linux base running under Cell Processor in order to fasten the software developing under the cell

architecture. Some Linux distributions made use of this base and developed a fully functional operating system running under cell architecture like Ubuntu yellow dog. However we have no reliable multiuse

OS so far using this architecture, most viewers believe that we are going to have reliable ones soon.

6. CELL FUTURE (Cell inside)

It’s well believed that Cell Processors will replace current processors in the next decade to replace current architectures in personal computers thanks to it’s performance and efficiency in addition to it’s low production cost. As with IBM already claiming the Cell processor can run current PowerPC software, it’s not hard to imagine Apple adopting it for future CPUs. A single 4.0 GHz Cell processor in an iBook or Mac mini would undoubtedly run circles around today’s 1.251.33 GHz entry level

Macs, and a quad processors Power Mac at 4.0 GHz should handily outperform today’s 2.5 GHz Power Mac G5. Then having most of Software and Hardware producers producing compatible Cell architecture products.

ACKNOWLEDGMENTS

This paper is dedicated for Academic purposes submitted as a research report, German University in Cairo.

7. REFERENCES

[1] Wikipedia.

[2] Cactus Home Page.

[3] Cell Broadband Engine Architecture and its first implementation, IBM.

[4] A streaming processing unit for a cell processor.


I was teaching a microprocessors’ design internals course last fall semester. I planned at the beginning to give my students the opportunity to design a toy microprocessor and optimize important performance factors, such as the pipelining, branch prediction, instructions issuance, etc. However, I decided to link them to the industry and give them a project to implement a simple parallel algorithm on the Cell Broadband Engine and monitor critical performance factors. At the end my objective was to teach them through a real processor possible design tradeoffs and their effect on performance and general effectiveness of the microprocessor. So I had 21 teams of 4 or 5 students working on different discrete algorithms, such as sorting, prime checking, matrix multiplication, and Fibonacci calculations. I asked them to submit a report and write at end of it possible architectural improvements to boost the Cell processor’s performance from their project’s experiences. I found out some interesting conclusions that worth sharing here with you. I reworded some of these suggestions and added some details since they are extracted from a different context.

  • Instructions fetching unit inside the SPEs may suffer from starvation if we have a lot of DMA requests that must be served. This can take place because the high priority assigned to the DMA requests inside the SPE. IBM suggests balancing your source code and including IFETCH instruction to give instructions fetching unit time to fetch more instructions from the cache. Some students suggested including a separate instructions cache; this would make instructions fetching independent from the DMA requests or registers load/store instructions. This should solve this problem and avoid some of the coding complexity while programming the Cell. Also given most of the applications written on the Cell, the text size is relatively very small. So if 64KB cache for code is built inside the next generation of the Cell processor may boost performance and guarantee most of the time smooth instructions execution.
  • A lot of the vector intrinsics were for vectors of floats. Many operations were required for vectors of integers. Students had to type cast to floats before using many of the vector operations, which of course may provide inaccurate answers and consumes more time.
  • Of course the most commonly asked improvement is increasing the LS size inside each SPE. The main reason for some students is to include more buffers and utilize more the multi-buffering technique and better performance at the end.
  • Other students went wild and suggested to change the priorities of the DMA, IFETCH, and MEM operations within the SPEs. Instead of having them DMA>MEM>IFETCH, they suggested to invert them to avoid starvation of instructions fetch unit.
  • Another worth mentioning suggestion is to create memory allocation function that would guarantee allocation of large data structures to different memory banks, which would reduce the DMA latency. For example, if we need a large array and each range will be accessed by a different SPE, we can allocate this array into different memory banks to avoid hotspots inside memory while the SPEs are executing. It is already done by the IBM’s FFT implementation on the Cell processors.

Of course I filtered out some suggests that are of a common sense to any programmer, such as avoiding the 16 bytes memory alignment. I was impressed by their ability to understand and pinpoint some serious problems inside the CBE in less than 6 weeks period.

CSEN 702 Class: Thanks!


Most probably your knowledge about the Cell processor and curiosity led you to this blog post.

Well, I had the same curiosity after working for a while with the Cell processor. I asked myself this question: Is the DMA latency for all SPEs the same? In other words, if one SPE is making a DMA request for exactly the same size data chunk, would it be delivered to its local storage in the same time of other SPEs?

The short and proven answer is: NO

Each SPE has a different DMA latency due to its physical location or its distance from the memory controller. There is only one memory controller inside the initial Cell implementation. The physical distance from memory controller makes a considerable difference of memory latency from one SPE to another. This latency difference gets event bigger as the DMA chunk gets bigger. For example, nearest SPE to the memory controller retrieves 4 KB from main memory in around 2000 nano seconds, while the farthest SPE receives the same chunk in around 420 nano seconds.

So, what does this mean? Or should I care about this?

Well, this means simply that double or multi buffering does not hide memory latency inside all SPEs with the same efficiency. SPEs located physically near the memory controller can have almost all the memory latency hidden, but far ones may still suffer from some latency.  You can download the code from here and test yourself if you have a Cell machine. It will  not work on the simulator, since it does not simulate the DMA latency, even in cycle mode.

If you are using double buffering, you are still getting a better performance compared to a single buffer. However, you are not getting the best possible performance. There is still more room for improvement.

If you have the new Power Cell 8Xi, this fact might be different since there are two memory controllers on that Cell implementation. Please share your numbers with us if you have it. You can find my measures here.


Yesterday the InternetNews.com released a piece of news about the end of the Cell Broadband Engine. David Turek the VP of deep computing at IBM said during an interview with the German site Heise Online that the power XCell-8i will be the last of the Cell line. IBM will be focusing on power7 processor, which is due mid 2010.

In this news article, it is mentioned, by Jon Peddie, that Cell processor had many shortcomings that became apparent, such as lack of direct access to the global memory by its computing engines (the SPEs) and wrongly mentioned that everything has to go through its powerPC core, which creates a bottleneck. This is technically not true. The PowerPC core is not handling any of the requests initiated by the other compute engines (the SPEs). Also, it might be noted by some researchers that its cache should be bigger, but its performance still noted by many to be the best compared to other multi-core processors fall within its category. In addition, the Cell processor taught a lot of developers and researchers the best parallel programming practices for multi-core processors. The fact that everything is controlled by the developer forced all its programmers to think better of the best ways to optimize their algorithm’s execution time.

Although I may somehow believe that IBM may do changes to the Cell processor. It is very difficult to believe that IBM will end its Cell processor line that soon. IBM invested a lot of money and time and also many of its customers invested tons of money adopting the Cell processor.

I think IBM is trying to produce its own line away from Sony and Toshiba without giving away $500 million worth of investment and five years of engineering. It is about business. The cell processor is one of the master pieces in the multi-core processors. And as mentioned by David Turk the future is for hybrid multi-core processors, for a very simple reason: they provide great ratio of processing speed to consumed power.

I think IBM will reuse the SPEs instruction set along with their traditional PowerPC architecture but the change might be in how the cache will be organized and managed. Also, I think IBM is rethinking the cores interconnect network. They may use either dynamic networks or a mix between on-chip-network and shared cache architecture.


In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.

The Cell Broadband Engine (CBE)
The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.
I won’t repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM’s website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.
The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.
Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core.  As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core’s latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.
Also, in the new implementation of the CBE  can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.



NVIDIA GeForce GPGPU
I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I  figured out some architectural properties that worth also sharing with you.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn’t find a clear way that would help me scheduling threads to processor’s cores. It is done, not sure yet, by the processor’s scheduler. It is following the same policy used by Intel’s hyper-threading and Sun’s multi-threading architecture. I’m now measuring if memory latency differs from one core to another.  My initial measures show that memory latency is almost the same across all cores. However, I’m a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.

Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best  performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.

Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA’s GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.

I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor’s architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.