March 2010

Performance auto-tuning is gaining higher focus as multi-core processors are becoming more complex. Current Petascale machines contain hundreds of thousands of cores. It is very difficult to reach the best performance using only manual ways to optimize algorithms execution over these machines. Performance auto-tuning is becoming a very important area of research. Efforts to design and build Exascale machines are actively undergoing. These machines will run billions of threads concurrently working on 100’s of millions of cores. Performance monitoring and optimization will be more challenging and interesting problem at the same time.

Current auto-tuning efforts focus on optimizing the execution of algorithms at the micro-level which will aggregate and get better performance across thousands of CPUs with tens of thousands of cores. Willimas Samuel, for example, tested several in-core and out-of-core automated source code optimizations by optimizing Stencil algorithms. In his research he, among other researchers, built auto-tuners for leading HPC architectures such as the Cell processor, GPGPUs, Sun Niagra, Power6, and Xeon processors. I’m impressed by the relatively large number of architectures he and his team tested this algorithm on.

However, after reading his and other related papers, I had two questions: Does auto-tuning at the level of each core or microprocessor guarantee by default best performance for the whole system? Aren’t there run-time parameters that should be considered in auto-tuning instead of focusing only on compile-time auto-tuning? For example, memory latency is variable at run-time based on the resources scheduling policies and the change in workloads.

Auto-tuning should be done collaboratively across all layers of the system including: operating systems, programming models & frameworks, run-time libraries, and applications. It is now relatively simple since most of the multi-/many-core architectures are managed by the run-time libraries, and the operating systems are not yet into the game of multi-core processors management seriously. For example, NVDIA GPGPU is managed by the CUDA run-time environment transparently from the operating system. It might be better to keep it this way since GPGPUs do not have direct access to system wide resources, such as the host system’s memory and I/O devices. However, as these architectures evolve, they will need access to system’s resources and operating systems will play bigger roles managing hundreds of cores. Have a look at this posting to understand more about the concerns of performance auto-tuning.

Auto-tuning should focus also on run-time parameters that would affect performance of these automatically tuned applications. It is becoming very difficult to predict the exact system behavior and, consequently, estimate accurately different latencies that would affect performance. For example, memory latency and bandwidth are not affected by compile-time parameters only. They are affected by: threads affinity, threads scheduling, other run-time system parameters such as page size and TLB.

I think run-time performance auto-tuning should have more attention for large systems. It may look initially that the limited control given to developers in some microprocessors may make achieving the best run-time parameterization very difficult or impossible. However, I see some leading architectures are giving control back to developers, sometimes indirectly. For example, utilizing the streaming features inside the GPGPUs is opening the space to optimize size, time, and number of streams based on the run-time memory performance. Also the zero-copy feature introduced inside the NVIDIA GTX-295 GPUs makes it possible to do run-time performance optimization. I post more details about the auto-tuning possibilities on these architectures.


The slowing pace of commodity microprocessor performance improvements combined with ever increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache based designs. In this work, we examine the potential of using the forthcoming Cell Processors as a building block for future high end computing systems. Our work contains several novel contributions. First, we give a background about the Cell Processor & it’s implementation history. Next, we give an overview about the architecture of the Cell Processors. Additionally, we compare Cell performance to benchmarks run on modern other architectures. Then, we derive some examples apply the Cell Broadband Engine Architecture technology to their designs. Also we give a brief about Cell software development & engineering under cell architecture & how software can doubles it’s efficiency & speed if implemented to use Cell Architecture. Finally we discuss the future of the Cell Architecture as one of the most advanced architectures in the market.


Sony Computer Entertainment, Toshiba Corporation & IBM had formed an alliance in year 2000 to design and manufacture a new generation of processors. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM’s design centers. Cell combines a general purpose Power Architecture core of modest performance with streamlined co-processing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The Cell architecture includes a novel memorycoherence architecture for which IBM received many patents. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. In 2005, Sony Computer Entertainment had confirmed some specifications of the Cell processor that is being shipped in it’s famous gaming console Play Station 3 console. This Cell configuration have one Power processing element (PPE) on the core, with eight physical SPE3 in silicon. This PS3’s Cell is the first Cell Architecture to be in the market. Although the Cell processor of the PS3 is not that advanced compared to current cell architectures being developed in IBM plants, it competed the most advanced processors in the market proving the architecture’s efficiency.


Cell takes a radical departure from conventional multiprocessor or multicore

architectures. In stead of using identical cooperating commodity processors, it uses a

conventional high performance PowerPC core that controls eight simple SIMD cores, called

synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller. Access to external memory is handled via a 25.6GB/s XDR memory controller. The cache coherent PowerPC core, the eight SPEs, the DRAM controller, and I/O controllers are all connected via 4 data rings, collectively known as the EIB. The ring interface within each unit allows 8 bytes/cycle to be read or written. Simultaneous transfers on the same ring are possible. All transfers are orchestrated by the PowerPC core. Each SPE includes four single precision (SP) 6 cycle pipelined FMA datapaths and one double precision (DP) halfpumped (the double precision operations within a SIMD operation must be serialized) 9 cycle pipelined FMA datapath with

4 cycles of overhead for data movement. Cell has a 7 cycle in order execution pipeline and forwarding network. IBM appears to have solved the problem of inserting a 13 (9+4) cycle DP pipeline into a 7 stage in order machine by choosing the minimum effort/performance/power solution of simply stalling for 6 cycles after issuing a DP instruciton. And now we have to take each element individually to define it and give a brief about it.

2.1 Power Processor Element

The PPE is the Power Architecture based, two way multi threaded core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32 KiB instruction and a 32 KiB data Level 1 cache and a 512 KiB Level 2 cache. Additionally, IBM has included an AltiVec unit which is fully pipelined for single precision floating point. (Altivec does not support double precision floating point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar fused multiply add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fusedmultiplyadd instruction, which translates to 25.6 GFLOPS at 3.2 GHz.

2.2 Synergistic Processing Elements (SPE)

Each SPE is composed of a “Synergistic Processing Unit”, SPU, and a “Memory Flow Controller”, MFC (DMA, MMU, and bus interface). An SPE is a RISC processor with 128bit SIMD organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB

embedded SRAM for instruction and data, called “Local Storage” (not to be mistaken for “Local Memory” in Sony’s documents that refer to the VRAM) which is visible to the PPE and can be

addressed directly by software. Each SPE can support up to 4 GiB of local store memory. The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128bit, 128 entry register file and measures 14.5 mm² on a 90 nm process. An SPE can operate on 16 8bit integers, 8 16bit integers, 4 32bit integers, or 4 single precision floatingpoint numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64bit virtual

memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space. In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a settop box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. Compared to a modern personal computer,

the relatively high overall floating point performance of a Cell processor seemingly dwarfs

the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. However, comparing only floating point abilities of a system is a one dimensional and application specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors. The Cell is designed to compensate for this with compiler assistance,

in which prepare to branch instructions are created. For double precision, as often used in personal computers, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS. Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication. Toshiba has developed a powered by four SPEs, but no PPE, called the Spurs Engine designed to accelerate 3D and movie effects in consumer electronics.

2.3 Element Interconnect Bus (EIB)

The EIB is a communication bus internal to the Cell processor which connects the various on chip

system elements: the PPE processor, the memory controller (MIC), the eight SPE co-processors, and two off chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as ‘units’. The EIB is presently implemented as a circular ring comprised of four 16Bwide unidirectional channels which counter rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU’s ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model. Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the

overall performance of the EIB as they reduce available concurrency. Despite IBM’s original desire to implement the EIB as a more powerful crossbar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the

worst case, the programmer must take extra care to schedule communication patterns where the EIB is

able to function at high concurrency levels.

2.4 Memory controller and I/O

Cell contains a dual channel next generation Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIOXDR link runs at 3.2 Gbit/s per pin. Two 32bit channels can provide a theoretical maximum of 25.6 GB/s. The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8bit wide point to point path. Five 8bit wide point to point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound 4

outbound lanes are supporting memory coherency.


High performance computing aims at maximizing the performance of grand challenge problems such as protein folding and accurate real time weather prediction. Where in the past, performance improvements were obtained by aggressive frequency scaling using micro architecture and manufacturing techniques, technology limits require future performance improvements be obtained from exploiting parallelism with a multi core design approach. The Cell Broadband Engine is an

exciting new execution platform answering this design challenge for compute intensive applications that reflects both the requirements of future computational workloads and manufacturing constraints.

The Cell B.E. is a heterogeneous chip multiprocessor architecture with compute accelerators achieving in excess of 200 Gflops per chip. The simplicity of the SPEs and the deterministic behavior of the explicitly controlled memory hierarchy make Cell amenable to performance prediction using a

simple analytic model. Using this approach, one can easily explore multiple variations of

an algorithm without the effort of programming each variation and running on either a fully cycle accurate simulator or hardware. With the newly released cycle accurate simulator (Mambo), we have successfully validated our performance model for SGEMM, SpMV, and Stencil Computations, as will be shown in the subsequent sections. Our modeling approach is broken into two steps commensurate with the two phase double buffered computational model. The kernels were first segmented into codes that operate only on data present in the local store of the SPE. We sketched the code snippets in SPE assembly and performed static timing analysis. The latency of each operation, issue width limitations, and the operand alignment requirements of the SIMD/quadword SPE execution pipeline determined the number of cycles required. The inorder nature and fixed local store memory latency of the SPEs makes the analysis deterministic and thus more tractable than on cache based, out of order microprocessors. In the second step, we construct a model that tabulates the time required for DMA loads and stores of the

operands required by the code snippets. The model accurately reflects the constraints imposed by

resource conflicts in the memory subsystem. For instance, concurrent DMAs issued by multiple SPEs must be serialized, as there is only a single DRAM controller. The model also presumes a conservative fixed DMA initiation latency of 1000 cycles. The model computes the total time by adding all the (outer loop) times, which are themselves computed by taking the maximum of the snippet and DMA transfer times. In some cases, the periteration times are constant across iterations, but in others it varies between iterations and is inputdependent. For example, in a sparse matrix, the memory access pattern depends on the nonzero structure of the matrix, which varies across iterations. Some algorithms may also require separate stages which have different execution times; e.g., the FFT has stages for loading data, loading constants, local computation, transpose, local computation, bit reversal, and storing the

results. For simplicity we chose to model a 3.2GHz, 8 SPE version of Cell with 25.6GB/s of memory bandwidth. This version of Cell is likely to be used in the first release of the Sony PlayStation3. The lower frequency had the simplifying benefit that both the EIB and DRAM controller could deliver two SP words per cycle. The maximum flop rate of such a machine would be 204.8 Gflop/s, with a computational intensity of 32 FLOPs/ word.


Many products are being implemented right now using Cell Processors those new hardware applications will change the aspect of performance in the world. Depending on the Cell Processors as

the brain power of those applications double the performance giving new experience to users. Now

we will derive some of those applications being implemented in many advanced technological

institutes in the world.

    1. Blade Server

IBM announced the BladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.

4.2 Console Video Games

Sony’s Play Station 3 game console contains the first production application of the Cell processor, clocked at 3.2 GHz and containing seven out of eight operational SPEs, to allow Sony to increase the

yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS. Although PS3’s games graphics are so advanced and heavy it runs so smoothly thanks to the Cell processor cores.

4.3 Home Cinema

Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition MPEG2 streams simultaneously on a 1920×1080 screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.

4.4 Super Computing

IBM’s new planned supercomputer, IBM Roadrunner, will be a hybrid of General Purpose CISC as well as Cell processors. It is reported that this combination will produce the first computer to run at petaflop speeds. It will use an updated version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128bit registers, reaching double precision 100 GFLOPs.

4.5 Cluster Computing

Clusters of PlayStation 3 consoles are an attractive alternative to highend systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8 node and 32 node PS3 clusters with Yellow Dog Linux preinstalled, an implementation of Dongarra’s research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using

PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. The computational

Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.

4.6 Distributed Computing

With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project Folding@Home has been recognized by Guinness World Records as the most powerful distributed network in the world. The first record was achieved on September 16,

2007, as the project surpassed one petaFLOPS, which had never been reached before by a

distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the

petaFLOPS mark on September 23, 2007. In comparison, the world’s most powerful supercomputer, IBM’s BlueGene/L, performs around 280.6 teraFLOPS. This means Folding@Home’s computing power is approximately four times BlueGene/L’s (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)


Software development for the cell microprocessor involve a mixture of conventional development practices for the POWER architecturecompatible PPU core, and novel software development challenges with regards to the functionally reduced SPU co processors. As we knew from previous sections that Cell processors are multicored with very high efficient parallelism, Software applications can double their performance if they made use of this architecture. For example IBM implemented a Linux base running under Cell Processor in order to fasten the software developing under the cell

architecture. Some Linux distributions made use of this base and developed a fully functional operating system running under cell architecture like Ubuntu yellow dog. However we have no reliable multiuse

OS so far using this architecture, most viewers believe that we are going to have reliable ones soon.

6. CELL FUTURE (Cell inside)

It’s well believed that Cell Processors will replace current processors in the next decade to replace current architectures in personal computers thanks to it’s performance and efficiency in addition to it’s low production cost. As with IBM already claiming the Cell processor can run current PowerPC software, it’s not hard to imagine Apple adopting it for future CPUs. A single 4.0 GHz Cell processor in an iBook or Mac mini would undoubtedly run circles around today’s 1.251.33 GHz entry level

Macs, and a quad processors Power Mac at 4.0 GHz should handily outperform today’s 2.5 GHz Power Mac G5. Then having most of Software and Hardware producers producing compatible Cell architecture products.


This paper is dedicated for Academic purposes submitted as a research report, German University in Cairo.


[1] Wikipedia.

[2] Cactus Home Page.

[3] Cell Broadband Engine Architecture and its first implementation, IBM.

[4] A streaming processing unit for a cell processor.


Nvidia Fermi is the codename of nvidia’s new GPU architecture. This architecture was announced by nvidia sometime in the second half of 2009 as a game changing architecture.

Competition & Long Term Strategy

Nvidia is facing tough competition from its two main rivals Intel and AMD. Both these two produce their own CPUs and GPUs while nvidia produces only GPUs. Nvidia has tried to somehow ease itself into a new market, which is the chipset market. Releasing custom nvidia chipsets which also incorporated a low end nvidia GPU which acted as an alternative to Intel’s Media Accelerator. These chipsets showed superior performance graphics wise compared to Intel’s solution. Several companies included these chipsets in their laptops to provide consumers with a better GPU experience in the low end graphics market. Also several companies included this chipset into what is called the Hybrid SLI architecture. Basically the Hybrid SLI architecture allows a laptop to have two GPUs on board; one low end weak one which drains very little battery power and one high end strong GPU. The Hybrid SLI architecture allows a user to dynamically switch between both based on his preferences. Nvidia also released a chipset for the new Atom processor which is widely used in current netbooks. Intel didn’t like this competition and felt threatened by nvidia. Intel therefore didn’t give nvidia the right to release chipsets for its new core i architecture and also sold the atom processor with its chipset cheaper than the processor alone. Thus driving nvidia totally out.

With nvidia locked out of the CPU and its chipset market it had only the GPU market to compete in. With the five main markets like seismology, supercomputing, university research workstations, defense and finance; which can represent about 1 billion dollar turnover; nvidia had to find a way to compete better. Nvidia saw a great opportunity in the use of the GPU’s large amount of processor cores in general computing application. It saw it as a new and untapped market which is very promising and could allow nvidia to widen its market share and revenues.

Nvidia started to research in the use of GPUs for high performance computing applications such as protein folding, stock options pricing, SQL queries and MRI reconstruction. Nvidia released its G80 based architecture cards in 2006 to address these applications. This was followed by the GT200 architecture in 2008/2009 which was built on G80’s architecture but provided better performance. While these architectures targeted what is called GPGPU or general purpose GPU, they were somehow limited in the sense that they targeted only specific applications and not all applications. The drawbacks of the GPGPU model was that it required the programmer to possess intimate knowledge of graphics APIs and GPU architecture, problems had to be expressed in terms of vertex coordinates, textures and shader programs which greatly increased program complexity, basic programming features such as random reads and writes to memory were not supported which greatly restricted the programming model and finally the lack of double precision support meant that some scientific applications could not be run on the GPU.

Nvidia came around this by introducing two new technologies. The G80 unified graphics and compute architecture and CUDA which is software hardware architecture which allowed the GPU to be programmed with a variety of high level programming languages such as C and C++. Therefore instead of using graphics APIs to model and program problems the programmer can write C programs with CUDA extensions and target a general purpose massively parallel processor. This was of GPU programming is commonly known as “`GPU Computing”‘. This allowed for a broader application support and programming language support.

Nvidia took what it has learned from its experience in the G80 and GT200 architectures to build a GPU with strong emphasize on giving a better GPU Computing experience while at the same time giving a superior graphics experience for normal GPU use. Nvidia based on its Fermi architecture on these two goal and regarded them as its long term strategy.

The Fermi architecture

Things needed to be changed

To allow Fermi to truly support “`GPU Computing”‘ some changes to the architecture had to be done. These changes can be summarized as follows:

  • Improve double precision performance: many high performance computing application make use of double precision operations. Nvidia had to increase the the DP operations performance in order to attract these markets.
  • ECC (Error Correcting Code): ECC allows users using GPU’s for data sensitive computations like medical imaging and financial options are protected against memory errors.
  • True Cache Hierarchy: The GPU architectures developed before Fermi didn’t contain an L1 cache, instead it contained a shared memory. Some users use algorithms that need true L1 cache.
  • More Shared Memory: Some users required more shared memory than the 16KB per SM.
  • Faster Context Switching: This allows for faster switches between GPU computing applications and normal graphics applications.
  • Faster Atomic Operations: These atomic operations are similar to (read – modify – write).

General Overview of the Fermi Architecture

  • 3 Billion transistors
  • 40nm TSMC
  • 384 bit memory interface (64 x 6)
  • 512 Shader Cores (CUDA cores)
  • 32 CUDA cores per shader cluster
  • 16 Shader Clusters
  • 1MB L1 Cache ( 64KB per shader cluster)
  • 768KB Unified L2 Cache
  • Up to 6GB GDDR5 Memory
  • IEEE 754 – 2008 Double precision standard
  • Six 64 bit memory controllers
  • ECC Support
  • 512 FMA in SP
  • 256 FMA in DP

3 Billion transistors is a huge number, which when compared with its closest competitor which is just over 2 Billion transistors; shows how big the nvidia Fermi will be. To be able to put this huge number of transistors nvidia had to switch from the 45nm fabrication processes to the 40nm processes. This allowed nvidia to put this huge number of transistors on a die without compromising with size and flexibility. But this also resulted in a very long delay to ship this chip. Due to relatively new fabrication processes and to the huge number of transistors on each chip, the yield of every wafer turned out to be very smaller, even smaller than expected. This hindered any hopes to mass produce the chip for large scale retail.

In Fermi nvidia aimed for a truly scalable architecture. Nvidia grouped every 4 SM (Stream Multiprocessor) into something called Graphics Processing Cluster or GPC. These GPC are in a sense a GPU on its own. Allowing nvidia to scale GPU cards up or down by increasing or decreasing the number of GPCs. Also scalability could be achieved by changing the number of SMs per GPC. Each GPC has its own rasterization engine which serves the 4 SMs that this GPC contains.

The SM Architecture

Each SM contains 32 stream processors or CUDA cores. This is 4x the amount of CUDA cores per SM compared to the previous GT200 architecture. These SM contain the following:

  • 32 CUDA cores
  • 4 SFU (Special Function Unit)
  • 32K (32,768) FP32 registers per SM. Up from 16K in GT200
  • 4 Texture Units
  • A PolyMorph (geometry) engine
  • 64K L1 Shared Memory / L1 Cache
  • 2 Warp Schedulers
  • 2 Dispatch Units
  • 16 load / store units

The SM 32 CUDA cores contain a fully pipelined ALU and FLU. These CUDA cores are able to perform 1 integer or floating point instruction per clock per thread in SP mode. There has been a huge improvement in the DP mode. DP instructions are now take only 2 times more than SP ones. This is a huge improvement when compared to 8 times the time in previous architectures. Also instructions can be mixed, for example FP + INT, FP + FP, SFU + FP and more. But if DP instructions are running then nothing else can run.

The Fermi also uses the new IEEE 754 – 2008 Standard for Floating Point Arithmetic instead of the new obsolete IEEE 754 – 1984 one. In previous architectures nvidia used the IEEE 754 1984 standard. In this standard nvidia nvidia handled one of the frequently used sequence of operations which is to multiply two numbers and add the result to a third number with a special instruction called MAD. MAD stands for Multiply-Add instruction which allowed both operations to be performed in a single clock. The MAD instruction performs multiplication with truncation. This was followed by addition with rounding to the nearest even. While this was acceptable for graphics applications, it didn’t meet the GPU Computing standards of needing a very accurate results. Therefore with the adoption of the new standard nvidia introduced a new instruction which is called FMA or Fused Multiply Add which supports both 32  bit single precision and 64 bit double precision floating point numbers. The FMA improves upon MAD in retaining full precision without any truncations or rounding to the nearest even. This allows precise mathematical calculations to be run on the GPU.

CUDA is a hardware and software blend that allows nvidia GPUs to be programmed with a wide range of programming languages. A CUDA program calls parallel kernels. Each kernel can execute in parallel across a set of parallel threads. The GPU first of all instantiates a kernel to a grid of parallel thread blocks, where each thread within a thread block executes and instance of the kernel.

Thread blocks are groups of concurrently executing threads that can cooperate among themselves through shared memory. Each thread within a thread block has its own per-Thread private local memory. While the thread block has its per-block shared memory. This per-block shared memory helps in inter thread communication, data sharing and result sharing between the different threads inside the same thread block. Also on a grid level, there is a per-Application context global memory. A grid is an array of blocks that execute the same kernel.

This hierarchy allows the GPU to execute one or more kernel grids, a streaming multiprocessor (SM) to execute one or more thread blocks and CUDA cores and other execution units in the SM to execute threads. Nvidia groups 32 threads in something called a warp. Each SM has two warp schedulers and two instruction dispatch units. This allows for two warps to be issued and executed in parallel on each SM.

The execution take place as follows. The dual warp schedulers inside each SM choose 2 warps for execution, one instruction from each warp is issued to be executed on a group of 16 cores, 16 load / store units or 4 SFU.

The streaming multiprocessor’s special function units (SFU) are used to execute special computations such as sine, cosine, reciprocal and square root. The SFU is decoupled from the dispatch unit. This decoupling means that the dispatch unit can still dispatch instructions for other execution units while the SFU is busy.

One of the biggest selling points of the new Fermi architecture is its implementation of a true cache. As stated earlier, earlier GPU architecture didn’t have a true L1 cache. Instead these architectures something called “`Shared Memory”‘. This was fine for graphics needs, but since nvidia is aiming to improve its GPU computing market share, it needed to implement a true L1 cache as it is often needed by some GPU computing applications. Nvidia included a 64KB configurable shared memory and L1 cache. To be able to handle both the graphics and GPU computing needs at the same time, this 64KB memory allows for the programmer to explicitly state the amount he needs to act as a shared memory and the amount to act as an L1 cache. Current options are for the programmer to have either 16KB L1 cache and 48KB shared memory or vice versa. This allowed the Fermi to keep the support for applications already written that made use of the shared memory while at the same time allowed new application to be written to make use of the new L1 cache.

For a long time there had been a huge gap between the geometry and shader performance. From the Geforce FX to the GT200, shader performance has increased with a factor of 150. But on the other hand the geometry performance only tripled. This was a huge problem that would bottleneck the GPU’s performance. This happened due to the fact that the hardware part that handles a key part of the setup engine has not been parallelized. Nvidia’s solution was to introduce something called a PolyMorph (geometry) Engine. The Engine facilitates a host of pre-rasterization stages, like vertex and hull shaders, tessellation and domain and geometry shaders. Each SM contains its own dedicated polymorph engine which will allow to overcome any bottlenecks by parallelizing the different units inside the PolyMorph Engine.

The SM also contains 4 separate texturing units. These units are responsible for rotate and resize a bitmap to be placed onto an arbitrary plane of a given 3D object as a texture.

Fermi Memory Hierarchy

In addition to the configurable 64 KB memory contained in each SM. The Fermi contains a unified L2 cache and DRAM. The size of the L2 cache is 768 KB.The 768KB unified L2 cache services all load, store and texture requests. The Fermi also contains 6 memory controllers. This large number of memory controllers allows the Fermi to support up to 6GB of GDDR5 memory. There can be several memory configurations supporting 1.5GB, 3GB or 6GB according to the field it will run in. It is important to mention that all types of memory from registers, to cache to DRAM memory are ECC protected.

The Unified Address Space

Fermi unified the address space between the three different memory spaces (thread private local, block shared and global). In the previous architecture the load and store operations were specific for each type of memory space. This posed a problem for GPU computing applications, as it made the task of the programmer more complex if not impossible to manage these different types of memory spaces, each with its own type of instruction. In the unified address space, Fermi puts all of the three different addresses into a single and continuous address space. Therefore Fermi unified the instruction to access all these types of memory spaces for a better experience. The unified address space uses a 40 bit addressing thus allowing for a Terabyte of memory to be addressed with the support of 64 bit addressing if needed.

The GigaThread Scheduler

The nvidia Fermi architecture makes use of two thread schedulers. The scope of each scheduler differs from the other. At the chip level there is global scheduler which schedules thread blocks to various SMs. This global scheduler is called the GigaThread Thread Scheduler. At a lower level and inside an SM there are two warp schedulers which schedule individual threads inside the warp / thread block. The GigaThread Scheduler handles a huge number of threads in real-time and also offers other improvements like faster context switching between GPU computing applications and graphics applications, concurrent kernel execution and improved thread block scheduling.


ROP stands for raster operators. The raster operator is the last step of the graphics pipeline which writes the textured / shaded pixels to the frame buffer. ROP are supposed to handle several chores towards the end of the graphics pipeline. Chores like anti-aliasing, Z and colour compression and ofcourse the writing of the pixels to the output buffer. Fermi contains 48 ROPs which are placed in a circle surrounding the L2 cache.

Nvidia Nexus

Nvidia Nexus is a development environment which was designed by nvidia to facilitate programming massively parallel CUDA C, OpenCL and DirectCompute applications for the nvidia Fermi cards. The environment is designed to be part of Microsoft Visual Studio IDE. Nexus allows for writing and debugging GPU source code in an easy way similar to the one used to develop normal CPU applications. It also allows to develop co-processing applications which make use of both the CPU and the GPU.

Presentation Slides and Report

Nvidia Fermi Presentation Slides

Nvidia Fermi Report


I was interested in I series processors and its architecture so I decided to read about it and know what’s new in this architecture. Then I decided to post about it because it may be useful for some people. At the beginning of this blog I will talk first about Intel’s processors history to get closer to Intel’s strategy then I will start taking about Nehalem architecture and its new features which all I series depends on, and at the end I will talk about the I-series different editions and products.

The first processor was made by Intel in 1971. It called the Intel 4004; it was a 4-bit processor which had a speed of 740 kHz. In 1976, Intel introduced the 16-bit 8086 processor which had a speed of 5 MHz. A later version of the 8086 was used to build the first personal computer by IBM. This was followed by the Intel 486, which was a 32-bit processor which had a speed of 16 MHz. During this time, several improvements in technology were made. For instance, processors could run in both real mode and protected mode, which introduced the concept of multitasking. Power-saving features, such as the System Management Mode (SMM), meant that the computer could power down various components. Computers finally went from command-line interaction to WIMP (Window, Icon, Menu, Pointing device) interaction.

In 1993, Intel introduced the Pentium processor which has a starting speed of 60 MHz. This was followed by the Pentium II which has a starting speed of 233 MHz, and the Pentium III which has a starting speed of 450 MHz, and the Pentium 4 which has a starting speed of 1.3 GHz. Later, Intel brought out the Celeron processor, which has a starting speed of 266 MHz. In 2003, Intel inaugurated the Pentium M processor, which ushered in a new era of mobile computing, under the Centrino platform. The Pentium M is slower, at 900 MHz, so that energy consumption is reduced and the battery of the laptop lasts longer. In 2006, Intel introduced the Core processor which has a starting speed of 1.6 GHz. It has more than one core, like in the case of Core Duo.

While Intel is the leading company in the manufacturing of processors, there are other companies such as AMD that make processors too. In 1991, AMD had brought out the Am386 processor and its starting speed is 40 MHz. It is compatible with the Intel 386 processor. In 1999, AMD introduced the Athlon processor which has a starting speed of 500 MHz. Athlon was a legitimate competitor to Intel Pentium III because it was faster. As a matter of fact, AMD Athlon was the first processor to reach the speed of 1 GHz. The future for the computer processor industry is promising, as processors will continue to get faster and cheaper. According to Moore’s Law, the number of transistors on a chip used to double every year, and from 1975, it used to double every two years.

Between 2009 / 2010, Intel introduced 3 new core processors which are i3, i5 and i7. Here I will focus on these 3 processors.

Nehalem Architecture:

The most important new features in Nehalem Architecture are:

Intel Turbo Boost Technology:

It automatically allows active processor cores to run faster than the base operating frequency when there is available headroom within power, current, and temperature specification limits.

Intel Turbo Boost Technology is activated when the operating system requests the highest processor performance state. The maximum frequency of Intel Turbo Boost Technology is dependent on the number of active cores. The amount of time the processor spends in the Intel Turbo Boost Technology state depends on the workload and operating environment.

Any of the following can set the upper limit of Intel Turbo Boost Technology on a given workload:

• Number of active cores

• Estimated current consumption

• Estimated power consumption

• Processor temperature

The number of active cores at any given instant affects the upper limit of Intel Turbo Boost Technology. For example, a particular processor may allow up to two frequency steps (266.66 MHz) when just one core is active and one frequency step (133.33 MHz) when two or more cores are active. The upper limits are further constrained by temperature, power, and current. These constraints are managed as a simple closed-loop control system. If measured temperature, power, and current are all below factory-configured limits, and the operating system (OS) is requesting maximum processor performance, the processor automatically steps up core frequency until it reaches the upper limit dictated by the number of active cores. When temperature, power, or current exceed factory-configured limits the processor automatically steps down core frequency in order to reduce temperature, power, and current. The processor then monitors temperature, power, and current, and continuously re-evaluates.

Intel Hyper Threading Technology:

Most multi core processors enable us to execute one thread per processor core Nehalem enables simultaneous multi-threading
within each processor core, up to two
threads per core or eight threads per
quad-core processor, so it enables eight software threads to be processed simultaneously.

Hyper-threading reduces computational latency, making optimal use of every clock cycle. For example, while one thread is waiting for a result or event, another thread is executing in that core to maximize the work from each clock cycle. An Intel® processor and chipset combined with an operating system and system firmware supporting Intel Hyper-Threading Technology enables:

• Running demanding applications simultaneously while maintaining system responsiveness

• Running multi-threaded applications faster to maximize productivity and performance

• Increasing the number of transactions that can be processed simultaneously

• Providing headroom for new solution capabilities and future needs

Other Key Performance Improvements:

Intel Smart Cache Enhancements Nehalem enhances the Intel Smart Cache by adding an inclusive shared L3 cache that can be up to eight megabytes (MB) in size. In addition to this cache being shared across all cores, the inclusive shared L3 cache can increase performance while reducing traffic to the processor cores. Some architectures use exclusive L3 cache, which contains data not stored in other caches. Thus, if data request misses on the L3 cache, each processor core must still be searched (or snooped) in case their individual caches might contain the requested data. This can increase latency and snoop traffic between cores. With Intel micro architecture (Nehalem), a miss of its inclusive shared L3 cache guarantees the data is outside the processor and thus is designed to eliminate unnecessary core snoops to reduce latency and improve performance.

The three-level cache hierarchy for Intel micro architecture (Nehalem) consists of:

• Same L1 cache as Intel Core micro architecture (32 KB Instruction Cache, 32 KB Data Cache)

• New L2 cache per core for very low latency (256 KB per core for handling data and instruction)

• New fully inclusive, fully shared 8 MB L3 cache (all applications can use entire cache)

Then comes an enormous Level 3 cache memory (8 MB) for managing communications between cores. While at first glance Nehalem’s cache hierarchy reminds one of Barcelona, the operation of the Level 3 cache is very different from AMD’s—it’s inclusive of all lower levels of the cache hierarchy. That means that if a core tries to access a data item and it’s not present in the Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t be there either. Conversely, if the data are present, four bits associated with each line of the cache memory (one bit per core) show whether or not the data are potentially present (potentially, but not with certainty) in the lower-level cache of another core, and which one.


This technique is effective for ensuring the coherency of the private caches because it limits the need for exchanges between cores. It has the disadvantage of wasting part of the cache memory with data that is already in other cache levels. That’s somewhat mitigated, however, by the fact that the L1 and L2 caches are relatively small compared to the L3 cache—all the data in the L1 and L2 caches takes up a maximum of 1.25 MB out of the 8 MB available. As on Barcelona, the Level 3 cache doesn’t operate at the same frequency as the rest of the chip. Consequently, latency of access to this level is variable, but it should be in the neighborhood of 40 cycles.

Intel SSE4.2 Intel micro architecture (Nehalem) adds seven new Application Targeted Accelerators for more efficient accelerated string and text processing of applications like XML. Take this line of XML code as an example using traditional Intel architecture instructions, you would have to identify characters one at a time to determine if it is a name character, white space character or metadata that process required 129 state transition to complete the parsing task. By use both equal any and equal range new operations to compare 16 bytes at once you cane quickly identify continuous blocks of name characters and isolated special characters with a single instruction, cutting the state transitions required from 129 to 21.

Loop Stream Detector Looping is coming for every type of application. Nehalem contains something called loop stream detector to optimize performance and energy efficiency. The looping detector first identifies repetitive instruction sequence, once it is detected the traditional branch prediction, fetch and decode stages are eliminated and power off during the loop executes. These identify more loops than before.

Instructions per Cycle The more instructions that can be run each clock cycle, the greater the performance.

In order to achieve this Intel introduced the following:

# Greater Parallelism: increase the amount of instructions that can be run “out of order.” To be able to identify more independent operations that can be run in parallel, Intel increased the size of the out-of-order window and scheduler, giving them a wider window from which to look for these operations. Intel also increased the size of the other buffers in the core to ensure they wouldn’t become a limiting factor.

# More Efficient Algorithms: Intel has included improved algorithms in places where previous processor generations saw lost performance due to stalls (dead cycles).

This includes:

1- Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the need to synchronize threads is also becoming more common. Intel micro architecture (Nehalem) speeds up the common synchronization primitives (such as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded software will see a performance boost.

2- Faster Handling of Branch Miss Predictions:

A common way to increase performance is through the prediction of branches. Intel micro architecture (Nehalem) optimizes the cases where the predictions are wrong, so that the effective penalty of branch miss predictions overall is lower than on prior processors.

3- Improved Hardware Prefetch and Better Load-Store Scheduling:

Intel micro architecture (Nehalem) continues the many advances Intel made with the Intel Core micro architecture (Penryn) family of processors in reducing memory access latencies through prefetch and loadstore scheduling improvements.

Enhanced Branch Prediction Branch prediction attempts to guess whether a conditional branch will be taken or not.
Branch predictors are crucial in today’s processors for achieving high performance.
They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Processors also use branch target prediction to attempt to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself. In addition to greater performance, an additional benefit of increased branch prediction accuracy is that it can enable the processor to consume less energy by spending less time executing miss predicted branch paths.
Intel micro architecture (Nehalem) uses several innovations to reduce branch miss predicts that can hinder performance and to improve the handling of branch miss predicts.

• New Second-Level Branch Target Buffer (BTB): To improve branch predictions in applications that have large code footprints (e.g., database applications), Intel added a second-level branch target buffer. BTB is slower, but looks at a much larger history of branches and whether or not they were taken. The inclusion of the L2 branch predictor enables applications with very large code sizes (database applications), to enjoy improved branch prediction accuracy.

• New Renamed Return Stack Buffer (RSB): The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into the return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you’ll always get the right data out of Nehalem’s stack even in the event of a mispredict.

Intel Quick Path Technology:

This new scalable, shared memory architecture delivers memory bandwidth leadership at up to 3.5 times the bandwidth of previous-generation processors. Intel Quick Path Technology is a platform architecture that provides high-speed (up to 25.6 GB/s), point-to-point connections between processors, and between processors and the I/O hub. Each processor has its own dedicated memory that it accesses directly through an Integrated Memory Controller. In cases where a processor needs to access the dedicated memory of another processor, it can do so through a high-speed Intel Quick Path Interconnect that links all the processors. Intel micro architecture (Nehalem) complements the benefits of Intel Quick Path Interconnect by enhancing Intel Smart Cache with an inclusive shared L3 cache that boosts performance while reducing traffic to the processor cores.

Intel Quick Path Interconnect Performance:

  • Intel Quick Path Interconnect’s throughput clearly demonstrates its best-in-class interconnect performance in the server/workstation market segment.
  • Intel Quick Path Interconnect uses up to 6.4 Giga transfers / second links, delivering up to 25 Gigabytes/second (GB/s) of total bandwidth. That does up to 300 percent greater than any other interconnect solution used previously.
  • Intel Quick Path Interconnect’s superior architecture reduces the amount of communication required in the interface of multi-processor systems to deliver faster payloads.
  • Intel Quick Path Interconnect Implicit Cyclic Redundancy Check (CRC) with link-level retry ensures data quality and performance by providing CRC without the performance penalty of additional cycles.

Intel Intelligent Power Technology:

Intel Intelligent Power Technology is an innovation that monitors power consumption in servers to identify those that are not being fully utilized. It has two main features:

• Integrated Power Gates allow individual idling cores to be reduced to near-zero power independent of other operating cores, reducing idle power consumption to 10 watts, versus 16 or 50 watts in prior generations of Intel quad-core processors7.

In the following scenario for example, if you are using a Core i7 with 4 cores, and the game you are using uses only a single core, the other three cores will turn off, reducing the heat produced by your processor, allowing the only running core to be automatically over clocked for higher performance. This new technology may be a compelling reason for many to no longer choose the faster clocked dual core processor over the slower quad core, as the quad core could offer now equal single threaded performance at the same price.

Automated Low-Power States automatically put processor and memory into the lowest available power states that will meet the requirement of the current workload. Because processors are enhanced with more and lower CPU power states, and the memory and I/O controllers have new power management features, the degree to which power can be minimized is now greatly enhanced.

Differences between i5 and i7:

First, there’s the LGA 1156 interconnect to the PCH, the new name that Intel gave to the chipset, which stands for Platform Controller Hub and is connected to the CPU via DMI, which is essentially how all Intel ICH south bridges had connected to the Northbridge. The DMI bus is something rather narrow, which apparently delivers only 2GB/s, or 1GB/s in each way. The beginning of times Intel CPUs use an external bus called Front Side Bus or simply FSB that is shared between memory and I/O requests.

The old FSB architecture is better well known: the Northbridge connected through a wide enough FSB to the processor and had the memory controller attached to it. Typical FSB’ s for Core 2 range from 1066MT/s to 1333MT/s, on a 64bit wide bus. This translates to a one way bandwidth of 8.5GB/s or 10.6GB/s in case of the 1333MT/s(or 1333MHz) of the FSB.

Intel CPUs have an embedded memory controller and thus will provide two external busses: a memory bus for connecting the CPU to the memory and an I/O bus to connect the CPU to the external world. This bus is a new bus called Quick Path Interconnect (QPI).

Each lane transfers 20 bits per time. From these 20 bits, 16 bits are used for data and the remaining 4 bits are used for a correction code called CRC (Cyclical Redundancy Check), which allows the receiver to check if the received data is intact. The first version of the Quick Path Interconnect works with a clock rate of 3.2 GHz transferring two data per clock cycle, a technique called DDR, Double Data Rate making the bus to work as if it was using a 6.4 GHz clock rate (Intel uses the GT/s unit – which means giga transfers per second – to represent this). Since 16 bits are transmitted per time, we have a maximum theoretical transfer rate of 12.8 GB/s on each lane (6.4 GHz x 16 bits / 8).

So compared to the front side bus QuickPath Interconnect transmits fewer bits per clock cycle but works at a far higher clock rate. Currently the fastest front side bus available on Intel processors is of 1,600 MHz (actually 400 MHz transferring four data per clock cycle, so QuickPath Interconnect works with a  base clock eight times higher), meaning a maximum theoretical transfer rate of 12.8 GB/s, the same as QuickPath. QPI, however, offers 12.8 GB/s on each direction, while a 1,600 MHz front side bus provides this bandwidth for both read and write operations – and both cannot be executed at the same time on the FSB, limitation not present on QPI. Also since the front side bus transfers both memory and I/O requests, there are always more data being transferred on this bus compared to QPI, which carries only I/O requests. So QPI will work “less busy” and thus having more bandwidth available.

QuickPath Interconnect is also faster than HyperTransport. The maximum transfer rate of HyperTransport technology is 10.4 GB/s (which is already slower than QuickPath Interconnect), but current Phenom processors use a lower transfer rate of 7.2 GB/s. So Intel Core i7 CPU will have an external bus 78% faster than the one used on AMD Phenom processors. Other CPUs from AMD like Athlon (formerly known as Athlon 64) and Athlon X2 (formerly known as Athlon 64 X2) use an even lower transfer rate, 4 GB/s – QPI is 220% faster than that.

There are obviously physical differences between Lynnfield and Bloomfield. Due to the new Lynnfield Core i5/i7 design changes, a new P55 mother board and LGA 1156 socket were designed to support it. In simpler terms, the Lynnfield is about a quarter inch smaller in size compared to the Bloomfield Core i7. Just make sure when you’re shopping for a Core i7 that you pay attention to the processor socket so you order the right motherboard to accompany it.


We can actually see the differences between both cores. While both architectures are all Quad Core, Bloomfield (above) utilizes dual QPI and is able to use an integrated triple channel memory controller where as the Lynnfield (below) supports a dual channel memory controller.


I7-900 and i7-800 processors:

A good example of how Nehalem micro architecture enables the scaling of energy efficiency and performance can be seen in the Intel core i7 Family. In 2009, Intel launched Core i7-800 processor under the code name Lynnfield, then it launched the i7-900 under code name of Bloomfield. Both of them based on the Nehalem Micro Architecture.

The Lynnfield architecture is quite similar to Bloomfield, after all both belongs to the Nehalem family and are produced in 45 nanometers technology. Thus the Front Side Bus known from the Core 2 is replaced by DMI in Lynnfield. But this connection (DMI) runs slower than the QPI of the Bloomfield and it can’t interact with other processors on the motherboard, so the Lynnfield is definitely not the right processor for multi socket systems. Furthermore the integrated memory controller of the Lynnfield supports Dual Channel only. Besides that there are no significant differences. The three cache levels are as big as those of the Bloomfield L1 cache of 32 KB Instruction Cache and 32 KB Data Cache, L2 cache per core for very low latency 256 KB per core for handling data and instruction, and new fully inclusive, fully shared 8 MB L3 cache.

I7-900 features VS i7-800 features:

I7-900 series editions:

I7-900 Power Consumption:

Bandwidth and latency of i7:

Generally speaking, the faster the processor, the higher the system wide bandwidth and the lower the latency. As is always the case, faster is better when it comes to processors, as we’ll see below. But with Core i7, the game changes up a bit.

Integer and float point operations bandwidth:

Memory Latency:

In terms of latency, not much has changed, even with the move to an integrated memory controller.

Multi-core efficiency:

How fast can one core swap data with another? It might not seem that important, but it definitely is if you are dealing with a true multi-threaded application. The faster data can be swapped around, the faster it’s going to be finished, so overall, inter-core speeds are important in every regard. Even without looking at the data, we know that Core i7 is going to excel here, for a few different reasons. The main is the fact that this is Intel’s first native Quad-Core. Rather than have two Dual-Core dies placed beside each other, i7 was built to place four cores together, so that in itself improves things. Past that, the ultra-fast QPI bus likely also has something to do with speed increases.

As we expected, Core i7 can swap data between its cores much faster than previous processors, and also manages to cut down significantly on latency. This is another feature to thank HyperThreading for, because without it, believe it or not, the bandwidth and latencies are actually a bit worse, clock-for-clock, as we’ll see soon.

In conclusion,

Nehalem is about improving HPC (high performance Computing), Database, and virtualization performance, and much less about gaming performance. Nehalem is only a small step forward in integer performance, and the gains due to slightly increased integer performance are mostly negated by the new cache system that’s because most games really like the huge L2 of the Core family. With Nehalem they are getting a 32KB L1 with a 4 cycle latency, next a very small 256KB L2 cache with 12 cycle latency, and after that a pretty slow 40 cycle 8MB L3 COMPARED TO Penryn which use a 3 cycle L1 and a 14 cycle 6144KB L2. The Penryn L2 is 24 times larger than on Nehalem.

The percentage of L2 caches misses for most games running on a Penryn CPU is extremely low. Now that is going to change. The integrated memory controller of Nehalem will help some, but the fact remains that the L3 is slow and the L2 is small. However, that doesn’t mean Intel made a bad choice Because Nehalem wasn’t made for the gaming, it was made to please the IT and HPC people.


For more information about Intel’s Quick Path Technology, please visit Intel’s website to see this demo:  “” .

ATI Radeon

Nowadays, there are a lot of graphics cards in the market. In the past it did not matter which card or GPU to get because the applications were not demanding. On the other hand, currently the cards are more advanced and are capable of providing a better graphical experience. The cards evolved from just being able to provide the poor colors 2D graphics of the past to the high detail 3D graphics of the present.

ATI is producing a series of cards called Radeon. This modern series followed a series called Rage. The Canadian company was able to establish itself in the market by either providing some of the best known graphics cards. The ATI GPUs can be found in gaming consoles such as Wii and Xbox. They also can be found in laptops.

Therefore, it became to be necessary to know how to choose a graphics card. The factors that a graphics card should be judged or selected are:

  • The use: simple like web surfing and document writing or more advanced like 3D animation and gaming.
  • The price: Does the price of the card suitable for the user budget? The price being high does not mean the card is good.
  • Core Clock: the core clock is the speed at which the graphics processor on the card operates.
  • Stream Processors: the stream processors are responsible for rendering. A big number of stream processors must exist on the  graphics card. The stream processors can be called other names as shader cores or thread processors.
  • FLOPS: The number of floating point operation per second. There are a single (32-bit) precision operations and double precision (64-bit) operations.
  • Memory :
  1. Memory Latency: the delay until the processor can access the memory. The problem that in many cases the processor is  faster than the memory.
  2. Bus width: The number of bits required to access the memory.
  3. Memory Clock: The speed at which the card can access memory.
  4. Type: DDR, DDR2, GDDR3, GDDR4, and GDDR5.
  5. Memory bandwidth: The speed at which the card can access memory. The size of the memory bus multiplied by the speed memory  core clock.
  • Power Consumption.

ATI Radeon’s Evergreen series:

The Canadian company created these series in 2009.

Products Code name Examples
HD 5400

Cedar Radeon HD 5450
HD 5500, HD 5600

Redwood Radeon HD 5670
HD 5700

Juniper Radeon HD 5770
HD 5800

Cypress Radeon HD 5870
HD 5900

Hemlock Radeon HD 5970

The architectures of the cards are related to each other and this can be viewed in the next figure.

Evergreen series

The Evergreen series architectures

Therefore speaking about any of the components in any of the architecture won’t differ. The Hemlock architecture is highly relevant to the Cypress. It can be said that the Hemlock contain two cypress.

The architecture that will be addressed in the rest of the document is the Cypress architecture.

Cypress architecture

The architecture of the Cypress

The Cypress consists of:

  • Command processor: issues commands and give it to graphics engine to translate into simpler forms.
  • Graphics engine: this engine is responsible for converting the polygons and meshes to a simpler form of data which is pixels.
  • Ultra threaded dispatch processor:Maximize utilization and efficiency by dividing the workload on the processing engines. For example if the engine is only capable of dealing with 4×4 pixel blocks and the frame size is about 16×16.

Then the frame will be divided on 16*16/4*4=16 engine.

SIMD engines

SIMD engines


The SIMD stand for Single instruction, multiple data”, it is applied here because many components can process multiple data using the same operation.

Each SIMD engine contains 16 stream units and 4 texturing units. Each stream unit consists of five 32-bit stream processors. There exist 20 SIMD engines in the Cypress.

So To get the total number of stream processors, multiply 16*5*20=1600 stream processors. And the number of texturing units =4*20=80.

  • Caches:

For each SIMD engine there exist a 8 KB L1 cache. The total size of L1 cache is 8*20=160 Kb. Each of the 8 KB caches stores unique data for each SIMD engine.

L1 cache bandwidth is 1 TB/sec while bandwidth between L1 and L2 is 435 GB/sec.

  • Stream Core:

Each Stream core has five processors in total; four of them capable of providing single precision computing while only one is used for special functions such sin ,cos , tan , and Exponential.

The stream processors are 32-bit , they can perform up 2.7 teraflops but for 64-bit operations, but the number of teraflops drops to 544 gigaflops.

Stream core

The stream core

  • Memory:
  1. 256-bit bus width.
  2. 1 GB GDDR5
  3. Memory clock speed: 2400 MHz
  4. 153.6GB/sec Bandwidth
  • Price: The price of HD 5870 is 410 $.

So the cost of one Teraflop for one precision operations=410/2.7=151.8$ and for double precision operations is 410/0.544=753.6$.As expected the double precision operation is expensive.

  • Power Consumption:

ATI Radeon aimed to reduce power consumption. The Cypress consumes idle power of 27 Watts. While at the worst case a maximum power of 188 Watts. The maximum power occurs when the user pushes the card to its highest performance by using over clocking. This doesn’t occur for users who just use the card for simple purposes.

The power consumed to provide one teraflops. The worst case is at maximum power.

Single precision Double precision
Maximum power 188/2.7=69.6 Watts 188/0.544=345.5 Watts

CrossFireX technology:

The CrossFireX enables the user to put from one to Four ATI Radeon cards on the same motherboard. The Evergreen series cards support the CrossFireX.



The technology must be also supported in the motherboard. The next figure illustrates the combination of cards and which motherboard can provide this technology.

Compatibility chart

Compatibility chart for the combination of the cards

The technology operates using one of the following modes:

  • Scissors: If there is one frame to be processed by two cards, if the cards are the same, there will be no problem. Each card will process a half of the screen. But what happens if the two cards are different?

The portions of the frame will be divided according to the capabilities of the card. The faster card will render a larger portion than the slower card. This will make the cards finish at the same time.

  • SuperTiling: If the frame is for example 4×4 pixels. The frame then is divided into tiles like the chessboard. One card will render frames 1,3,5 until 15 and the other will render 2,4,6 until 16.
  • Alternate Frame Rendering: When a card is processing the present frame, another card is processing the next (future) frame.
  • Super AA: AA stand for anti aliasing. It provides ant aliasing to increase image quality.

The CrossFireX technology is highly similar to nVidia SLI. In the cards produced by both companies, there is a high similarity in components and architecture which makes it easy to compare and decide which card to purchase.

The next series after the Evergreen is called northern islands and will be available in the market in the end 2010 or 2011. The major difference that the northern island has 32 nm fabrication processes while the Evergreen has 40 nm.

The northern island will exceed Evergreen in performance but with the CrossFireX introduced and using more than one Evergreen card together. This will help in keeping up the performance and not upgrading to a new card but not for long.