The slowing pace of commodity microprocessor performance improvements combined with ever increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache based designs. In this work, we examine the potential of using the forthcoming Cell Processors as a building block for future high end computing systems. Our work contains several novel contributions. First, we give a background about the Cell Processor & it’s implementation history. Next, we give an overview about the architecture of the Cell Processors. Additionally, we compare Cell performance to benchmarks run on modern other architectures. Then, we derive some examples apply the Cell Broadband Engine Architecture technology to their designs. Also we give a brief about Cell software development & engineering under cell architecture & how software can doubles it’s efficiency & speed if implemented to use Cell Architecture. Finally we discuss the future of the Cell Architecture as one of the most advanced architectures in the market.
Sony Computer Entertainment, Toshiba Corporation & IBM had formed an alliance in year 2000 to design and manufacture a new generation of processors. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM’s design centers. Cell combines a general purpose Power Architecture core of modest performance with streamlined co-processing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The Cell architecture includes a novel memorycoherence architecture for which IBM received many patents. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. In 2005, Sony Computer Entertainment had confirmed some specifications of the Cell processor that is being shipped in it’s famous gaming console Play Station 3 console. This Cell configuration have one Power processing element (PPE) on the core, with eight physical SPE3 in silicon. This PS3’s Cell is the first Cell Architecture to be in the market. Although the Cell processor of the PS3 is not that advanced compared to current cell architectures being developed in IBM plants, it competed the most advanced processors in the market proving the architecture’s efficiency.
2. CELL ARCHITECTURE
Cell takes a radical departure from conventional multiprocessor or multicore
architectures. In stead of using identical cooperating commodity processors, it uses a
conventional high performance PowerPC core that controls eight simple SIMD cores, called
synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller. Access to external memory is handled via a 25.6GB/s XDR memory controller. The cache coherent PowerPC core, the eight SPEs, the DRAM controller, and I/O controllers are all connected via 4 data rings, collectively known as the EIB. The ring interface within each unit allows 8 bytes/cycle to be read or written. Simultaneous transfers on the same ring are possible. All transfers are orchestrated by the PowerPC core. Each SPE includes four single precision (SP) 6 cycle pipelined FMA datapaths and one double precision (DP) halfpumped (the double precision operations within a SIMD operation must be serialized) 9 cycle pipelined FMA datapath with
4 cycles of overhead for data movement. Cell has a 7 cycle in order execution pipeline and forwarding network. IBM appears to have solved the problem of inserting a 13 (9+4) cycle DP pipeline into a 7 stage in order machine by choosing the minimum effort/performance/power solution of simply stalling for 6 cycles after issuing a DP instruciton. And now we have to take each element individually to define it and give a brief about it.
2.1 Power Processor Element
The PPE is the Power Architecture based, two way multi threaded core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 KiB instruction and a 32 KiB data Level 1 cache and a 512 KiB Level 2 cache. Additionally, IBM has included an AltiVec unit which is fully pipelined for single precision floating point. (Altivec does not support double precision floating point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar fused multiply add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fusedmultiplyadd instruction, which translates to 25.6 GFLOPS at 3.2 GHz.
2.2 Synergistic Processing Elements (SPE)
Each SPE is composed of a “Synergistic Processing Unit”, SPU, and a “Memory Flow Controller”, MFC (DMA, MMU, and bus interface). An SPE is a RISC processor with 128bit SIMD organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB
embedded SRAM for instruction and data, called “Local Storage” (not to be mistaken for “Local Memory” in Sony’s documents that refer to the VRAM) which is visible to the PPE and can be
addressed directly by software. Each SPE can support up to 4 GiB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128bit, 128 entry register file and measures 14.5 mm² on a 90 nm process. An SPE can operate on 16 8bit integers, 8 16bit integers, 4 32bit integers, or 4 single precision floatingpoint numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64bit virtual
memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a settop box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. Compared to a modern personal computer,
the relatively high overall floating point performance of a Cell processor seemingly dwarfs
the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. However, comparing only floating point abilities of a system is a one dimensional and application specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors. The Cell is designed to compensate for this with compiler assistance,
in which prepare to branch instructions are created. For double precision, as often used in personal computers, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS. Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication. Toshiba has developed a powered by four SPEs, but no PPE, called the Spurs Engine designed to accelerate 3D and movie effects in consumer electronics.
2.3 Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on chip
system elements: the PPE processor, the memory controller (MIC), the eight SPE co-processors, and two off chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as ‘units’. The EIB is presently implemented as a circular ring comprised of four 16Bwide unidirectional channels which counter rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU’s ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model. Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the
overall performance of the EIB as they reduce available concurrency. Despite IBM’s original desire to implement the EIB as a more powerful crossbar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the
worst case, the programmer must take extra care to schedule communication patterns where the EIB is
able to function at high concurrency levels.
2.4 Memory controller and I/O
Cell contains a dual channel next generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIOXDR link runs at 3.2 Gbit/s per pin. Two 32bit channels can provide a theoretical maximum of 25.6 GB/s. The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8bit wide point to point path. Five 8bit wide point to point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound 4
outbound lanes are supporting memory coherency.
High performance computing aims at maximizing the performance of grand challenge problems such as protein folding and accurate real time weather prediction. Where in the past, performance improvements were obtained by aggressive frequency scaling using micro architecture and manufacturing techniques, technology limits require future performance improvements be obtained from exploiting parallelism with a multi core design approach. The Cell Broadband Engine is an
exciting new execution platform answering this design challenge for compute intensive applications that reflects both the requirements of future computational workloads and manufacturing constraints.
The Cell B.E. is a heterogeneous chip multiprocessor architecture with compute accelerators achieving in excess of 200 Gflops per chip. The simplicity of the SPEs and the deterministic behavior of the explicitly controlled memory hierarchy make Cell amenable to performance prediction using a
simple analytic model. Using this approach, one can easily explore multiple variations of
an algorithm without the effort of programming each variation and running on either a fully cycle accurate simulator or hardware. With the newly released cycle accurate simulator (Mambo), we have successfully validated our performance model for SGEMM, SpMV, and Stencil Computations, as will be shown in the subsequent sections. Our modeling approach is broken into two steps commensurate with the two phase double buffered computational model. The kernels were first segmented into codes that operate only on data present in the local store of the SPE. We sketched the code snippets in SPE assembly and performed static timing analysis. The latency of each operation, issue width limitations, and the operand alignment requirements of the SIMD/quadword SPE execution pipeline determined the number of cycles required. The inorder nature and fixed local store memory latency of the SPEs makes the analysis deterministic and thus more tractable than on cache based, out of order microprocessors. In the second step, we construct a model that tabulates the time required for DMA loads and stores of the
operands required by the code snippets. The model accurately reflects the constraints imposed by
resource conflicts in the memory subsystem. For instance, concurrent DMAs issued by multiple SPEs must be serialized, as there is only a single DRAM controller. The model also presumes a conservative fixed DMA initiation latency of 1000 cycles. The model computes the total time by adding all the (outer loop) times, which are themselves computed by taking the maximum of the snippet and DMA transfer times. In some cases, the periteration times are constant across iterations, but in others it varies between iterations and is inputdependent. For example, in a sparse matrix, the memory access pattern depends on the nonzero structure of the matrix, which varies across iterations. Some algorithms may also require separate stages which have different execution times; e.g., the FFT has stages for loading data, loading constants, local computation, transpose, local computation, bit reversal, and storing the
results. For simplicity we chose to model a 3.2GHz, 8 SPE version of Cell with 25.6GB/s of memory bandwidth. This version of Cell is likely to be used in the first release of the Sony PlayStation3. The lower frequency had the simplifying benefit that both the EIB and DRAM controller could deliver two SP words per cycle. The maximum flop rate of such a machine would be 204.8 Gflop/s, with a computational intensity of 32 FLOPs/ word.
Many products are being implemented right now using Cell Processors those new hardware applications will change the aspect of performance in the world. Depending on the Cell Processors as
the brain power of those applications double the performance giving new experience to users. Now
we will derive some of those applications being implemented in many advanced technological
institutes in the world.
- Blade Server
IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.
4.2 Console Video Games
Sony’s Play Station 3 game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the
yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS. Although PS3’s games graphics are so advanced and heavy it runs so smoothly thanks to the Cell processor cores.
4.3 Home Cinema
Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG2 streams simultaneously on a 1920×1080 screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.
4.4 Super Computing
IBM’s new planned supercomputer, IBM Roadrunner, will be a hybrid of General Purpose CISC as well as Cell processors. It is reported that this combination will produce the first computer to run at petaflop speeds. It will use an updated version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128bit registers, reaching double precision 100 GFLOPs.
4.5 Cluster Computing
Clusters of PlayStation 3 consoles are an attractive alternative to highend systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8 node and 32 node PS3 clusters with Yellow Dog Linux preinstalled, an implementation of Dongarra’s research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using
PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. The computational
Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.
4.6 Distributed Computing
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project Folding@Home has been recognized by Guinness World Records as the most powerful distributed network in the world. The first record was achieved on September 16,
2007, as the project surpassed one petaFLOPS, which had never been reached before by a
distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the
petaFLOPS mark on September 23, 2007. In comparison, the world’s most powerful supercomputer, IBM’s BlueGene/L, performs around 280.6 teraFLOPS. This means Folding@Home’s computing power is approximately four times BlueGene/L’s (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)
5. SOFTWARE ENGINEERING
Software development for the cell microprocessor involve a mixture of conventional development practices for the POWER architecturecompatible PPU core, and novel software development challenges with regards to the functionally reduced SPU co processors. As we knew from previous sections that Cell processors are multicored with very high efficient parallelism, Software applications can double their performance if they made use of this architecture. For example IBM implemented a Linux base running under Cell Processor in order to fasten the software developing under the cell
architecture. Some Linux distributions made use of this base and developed a fully functional operating system running under cell architecture like Ubuntu yellow dog. However we have no reliable multiuse
OS so far using this architecture, most viewers believe that we are going to have reliable ones soon.
6. CELL FUTURE (Cell inside)
It’s well believed that Cell Processors will replace current processors in the next decade to replace current architectures in personal computers thanks to it’s performance and efficiency in addition to it’s low production cost. As with IBM already claiming the Cell processor can run current PowerPC software, it’s not hard to imagine Apple adopting it for future CPUs. A single 4.0 GHz Cell processor in an iBook or Mac mini would undoubtedly run circles around today’s 1.251.33 GHz entry level
Macs, and a quad processors Power Mac at 4.0 GHz should handily outperform today’s 2.5 GHz Power Mac G5. Then having most of Software and Hardware producers producing compatible Cell architecture products.
This paper is dedicated for Academic purposes submitted as a research report, German University in Cairo.
 Cactus Home Page.
 Cell Broadband Engine Architecture and its first implementation, IBM.
 A streaming processing unit for a cell processor.