multi-core processsors

I’ll attend (ISA) The AMD’s Fusion Developer Summit in Bellevue, WA from June 13th to 16th. Stay tuned for my related blog posts and my tweets during the day as well.

You can follow my Twitter Account:

Also feel free to send me your asks or questions about it.

If you are taking computer architecture classes, studying electronics, or doing research related to microprocessors, you may have heard of warnings about the end of Moore’s law: microprocessors will soon not be able to double their performance every 18 months for some physical limits related to making transistors smaller and keeping them fairly efficient in power consumption.

Microprocessors depended mainly on three main factors to keep Moore’s law in effect: (1) Reducing transistor size, so that we can have more in the same area with more sophisticated execution logic and be able to fit in more cache, (2) Increasing Transistor frequency, to execute more instructions, and (3) Economics of manufacturing, to keep the next generation of microprocessors affordable to everyone. Right now it is difficult to cram more transistors due to current limits on lithography. Also, as transistors get smaller and operate at higher frequencies, their power consumption is increasing at greater rates than the increase in performance. Finally, the manufacturing cost is increasing astronomically as we move from one generation to another.

I think Moore’s law may not live as it stands right now. The pattern may keep going by through different means. Here is my stab on it:

Reconsidering the execution pipelines to have shorter latency time per instruction.

Increasing the number of cores, which increase the overall throughput. This is possible through making a better use of the total number of transistors that can fit in one chip. It is possible to work since pipelines should be of less depth.

Homogeneity of instructions set and heterogeneity in implementations. For example, a multi-core processor may have 32 cores with same arithmetic and logical operations instructions, but only two or four of them implement other system control instructions, such as protected mode and interrupt handling instructions. Applications may not need radical rewriting in this case. Actually we can automate the process of migrating these traditional multi-threaded applications to this new heterogeneous architecture.

I was interested in I series processors and its architecture so I decided to read about it and know what’s new in this architecture. Then I decided to post about it because it may be useful for some people. At the beginning of this blog I will talk first about Intel’s processors history to get closer to Intel’s strategy then I will start taking about Nehalem architecture and its new features which all I series depends on, and at the end I will talk about the I-series different editions and products.

The first processor was made by Intel in 1971. It called the Intel 4004; it was a 4-bit processor which had a speed of 740 kHz. In 1976, Intel introduced the 16-bit 8086 processor which had a speed of 5 MHz. A later version of the 8086 was used to build the first personal computer by IBM. This was followed by the Intel 486, which was a 32-bit processor which had a speed of 16 MHz. During this time, several improvements in technology were made. For instance, processors could run in both real mode and protected mode, which introduced the concept of multitasking. Power-saving features, such as the System Management Mode (SMM), meant that the computer could power down various components. Computers finally went from command-line interaction to WIMP (Window, Icon, Menu, Pointing device) interaction.

In 1993, Intel introduced the Pentium processor which has a starting speed of 60 MHz. This was followed by the Pentium II which has a starting speed of 233 MHz, and the Pentium III which has a starting speed of 450 MHz, and the Pentium 4 which has a starting speed of 1.3 GHz. Later, Intel brought out the Celeron processor, which has a starting speed of 266 MHz. In 2003, Intel inaugurated the Pentium M processor, which ushered in a new era of mobile computing, under the Centrino platform. The Pentium M is slower, at 900 MHz, so that energy consumption is reduced and the battery of the laptop lasts longer. In 2006, Intel introduced the Core processor which has a starting speed of 1.6 GHz. It has more than one core, like in the case of Core Duo.

While Intel is the leading company in the manufacturing of processors, there are other companies such as AMD that make processors too. In 1991, AMD had brought out the Am386 processor and its starting speed is 40 MHz. It is compatible with the Intel 386 processor. In 1999, AMD introduced the Athlon processor which has a starting speed of 500 MHz. Athlon was a legitimate competitor to Intel Pentium III because it was faster. As a matter of fact, AMD Athlon was the first processor to reach the speed of 1 GHz. The future for the computer processor industry is promising, as processors will continue to get faster and cheaper. According to Moore’s Law, the number of transistors on a chip used to double every year, and from 1975, it used to double every two years.

Between 2009 / 2010, Intel introduced 3 new core processors which are i3, i5 and i7. Here I will focus on these 3 processors.

Nehalem Architecture:

The most important new features in Nehalem Architecture are:

Intel Turbo Boost Technology:

It automatically allows active processor cores to run faster than the base operating frequency when there is available headroom within power, current, and temperature specification limits.

Intel Turbo Boost Technology is activated when the operating system requests the highest processor performance state. The maximum frequency of Intel Turbo Boost Technology is dependent on the number of active cores. The amount of time the processor spends in the Intel Turbo Boost Technology state depends on the workload and operating environment.

Any of the following can set the upper limit of Intel Turbo Boost Technology on a given workload:

• Number of active cores

• Estimated current consumption

• Estimated power consumption

• Processor temperature

The number of active cores at any given instant affects the upper limit of Intel Turbo Boost Technology. For example, a particular processor may allow up to two frequency steps (266.66 MHz) when just one core is active and one frequency step (133.33 MHz) when two or more cores are active. The upper limits are further constrained by temperature, power, and current. These constraints are managed as a simple closed-loop control system. If measured temperature, power, and current are all below factory-configured limits, and the operating system (OS) is requesting maximum processor performance, the processor automatically steps up core frequency until it reaches the upper limit dictated by the number of active cores. When temperature, power, or current exceed factory-configured limits the processor automatically steps down core frequency in order to reduce temperature, power, and current. The processor then monitors temperature, power, and current, and continuously re-evaluates.

Intel Hyper Threading Technology:

Most multi core processors enable us to execute one thread per processor core Nehalem enables simultaneous multi-threading
within each processor core, up to two
threads per core or eight threads per
quad-core processor, so it enables eight software threads to be processed simultaneously.

Hyper-threading reduces computational latency, making optimal use of every clock cycle. For example, while one thread is waiting for a result or event, another thread is executing in that core to maximize the work from each clock cycle. An Intel® processor and chipset combined with an operating system and system firmware supporting Intel Hyper-Threading Technology enables:

• Running demanding applications simultaneously while maintaining system responsiveness

• Running multi-threaded applications faster to maximize productivity and performance

• Increasing the number of transactions that can be processed simultaneously

• Providing headroom for new solution capabilities and future needs

Other Key Performance Improvements:

Intel Smart Cache Enhancements Nehalem enhances the Intel Smart Cache by adding an inclusive shared L3 cache that can be up to eight megabytes (MB) in size. In addition to this cache being shared across all cores, the inclusive shared L3 cache can increase performance while reducing traffic to the processor cores. Some architectures use exclusive L3 cache, which contains data not stored in other caches. Thus, if data request misses on the L3 cache, each processor core must still be searched (or snooped) in case their individual caches might contain the requested data. This can increase latency and snoop traffic between cores. With Intel micro architecture (Nehalem), a miss of its inclusive shared L3 cache guarantees the data is outside the processor and thus is designed to eliminate unnecessary core snoops to reduce latency and improve performance.

The three-level cache hierarchy for Intel micro architecture (Nehalem) consists of:

• Same L1 cache as Intel Core micro architecture (32 KB Instruction Cache, 32 KB Data Cache)

• New L2 cache per core for very low latency (256 KB per core for handling data and instruction)

• New fully inclusive, fully shared 8 MB L3 cache (all applications can use entire cache)

Then comes an enormous Level 3 cache memory (8 MB) for managing communications between cores. While at first glance Nehalem’s cache hierarchy reminds one of Barcelona, the operation of the Level 3 cache is very different from AMD’s—it’s inclusive of all lower levels of the cache hierarchy. That means that if a core tries to access a data item and it’s not present in the Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t be there either. Conversely, if the data are present, four bits associated with each line of the cache memory (one bit per core) show whether or not the data are potentially present (potentially, but not with certainty) in the lower-level cache of another core, and which one.


This technique is effective for ensuring the coherency of the private caches because it limits the need for exchanges between cores. It has the disadvantage of wasting part of the cache memory with data that is already in other cache levels. That’s somewhat mitigated, however, by the fact that the L1 and L2 caches are relatively small compared to the L3 cache—all the data in the L1 and L2 caches takes up a maximum of 1.25 MB out of the 8 MB available. As on Barcelona, the Level 3 cache doesn’t operate at the same frequency as the rest of the chip. Consequently, latency of access to this level is variable, but it should be in the neighborhood of 40 cycles.

Intel SSE4.2 Intel micro architecture (Nehalem) adds seven new Application Targeted Accelerators for more efficient accelerated string and text processing of applications like XML. Take this line of XML code as an example using traditional Intel architecture instructions, you would have to identify characters one at a time to determine if it is a name character, white space character or metadata that process required 129 state transition to complete the parsing task. By use both equal any and equal range new operations to compare 16 bytes at once you cane quickly identify continuous blocks of name characters and isolated special characters with a single instruction, cutting the state transitions required from 129 to 21.

Loop Stream Detector Looping is coming for every type of application. Nehalem contains something called loop stream detector to optimize performance and energy efficiency. The looping detector first identifies repetitive instruction sequence, once it is detected the traditional branch prediction, fetch and decode stages are eliminated and power off during the loop executes. These identify more loops than before.

Instructions per Cycle The more instructions that can be run each clock cycle, the greater the performance.

In order to achieve this Intel introduced the following:

# Greater Parallelism: increase the amount of instructions that can be run “out of order.” To be able to identify more independent operations that can be run in parallel, Intel increased the size of the out-of-order window and scheduler, giving them a wider window from which to look for these operations. Intel also increased the size of the other buffers in the core to ensure they wouldn’t become a limiting factor.

# More Efficient Algorithms: Intel has included improved algorithms in places where previous processor generations saw lost performance due to stalls (dead cycles).

This includes:

1- Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the need to synchronize threads is also becoming more common. Intel micro architecture (Nehalem) speeds up the common synchronization primitives (such as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded software will see a performance boost.

2- Faster Handling of Branch Miss Predictions:

A common way to increase performance is through the prediction of branches. Intel micro architecture (Nehalem) optimizes the cases where the predictions are wrong, so that the effective penalty of branch miss predictions overall is lower than on prior processors.

3- Improved Hardware Prefetch and Better Load-Store Scheduling:

Intel micro architecture (Nehalem) continues the many advances Intel made with the Intel Core micro architecture (Penryn) family of processors in reducing memory access latencies through prefetch and loadstore scheduling improvements.

Enhanced Branch Prediction Branch prediction attempts to guess whether a conditional branch will be taken or not.
Branch predictors are crucial in today’s processors for achieving high performance.
They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Processors also use branch target prediction to attempt to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself. In addition to greater performance, an additional benefit of increased branch prediction accuracy is that it can enable the processor to consume less energy by spending less time executing miss predicted branch paths.
Intel micro architecture (Nehalem) uses several innovations to reduce branch miss predicts that can hinder performance and to improve the handling of branch miss predicts.

• New Second-Level Branch Target Buffer (BTB): To improve branch predictions in applications that have large code footprints (e.g., database applications), Intel added a second-level branch target buffer. BTB is slower, but looks at a much larger history of branches and whether or not they were taken. The inclusion of the L2 branch predictor enables applications with very large code sizes (database applications), to enjoy improved branch prediction accuracy.

• New Renamed Return Stack Buffer (RSB): The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into the return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you’ll always get the right data out of Nehalem’s stack even in the event of a mispredict.

Intel Quick Path Technology:

This new scalable, shared memory architecture delivers memory bandwidth leadership at up to 3.5 times the bandwidth of previous-generation processors. Intel Quick Path Technology is a platform architecture that provides high-speed (up to 25.6 GB/s), point-to-point connections between processors, and between processors and the I/O hub. Each processor has its own dedicated memory that it accesses directly through an Integrated Memory Controller. In cases where a processor needs to access the dedicated memory of another processor, it can do so through a high-speed Intel Quick Path Interconnect that links all the processors. Intel micro architecture (Nehalem) complements the benefits of Intel Quick Path Interconnect by enhancing Intel Smart Cache with an inclusive shared L3 cache that boosts performance while reducing traffic to the processor cores.

Intel Quick Path Interconnect Performance:

  • Intel Quick Path Interconnect’s throughput clearly demonstrates its best-in-class interconnect performance in the server/workstation market segment.
  • Intel Quick Path Interconnect uses up to 6.4 Giga transfers / second links, delivering up to 25 Gigabytes/second (GB/s) of total bandwidth. That does up to 300 percent greater than any other interconnect solution used previously.
  • Intel Quick Path Interconnect’s superior architecture reduces the amount of communication required in the interface of multi-processor systems to deliver faster payloads.
  • Intel Quick Path Interconnect Implicit Cyclic Redundancy Check (CRC) with link-level retry ensures data quality and performance by providing CRC without the performance penalty of additional cycles.

Intel Intelligent Power Technology:

Intel Intelligent Power Technology is an innovation that monitors power consumption in servers to identify those that are not being fully utilized. It has two main features:

• Integrated Power Gates allow individual idling cores to be reduced to near-zero power independent of other operating cores, reducing idle power consumption to 10 watts, versus 16 or 50 watts in prior generations of Intel quad-core processors7.

In the following scenario for example, if you are using a Core i7 with 4 cores, and the game you are using uses only a single core, the other three cores will turn off, reducing the heat produced by your processor, allowing the only running core to be automatically over clocked for higher performance. This new technology may be a compelling reason for many to no longer choose the faster clocked dual core processor over the slower quad core, as the quad core could offer now equal single threaded performance at the same price.

Automated Low-Power States automatically put processor and memory into the lowest available power states that will meet the requirement of the current workload. Because processors are enhanced with more and lower CPU power states, and the memory and I/O controllers have new power management features, the degree to which power can be minimized is now greatly enhanced.

Differences between i5 and i7:

First, there’s the LGA 1156 interconnect to the PCH, the new name that Intel gave to the chipset, which stands for Platform Controller Hub and is connected to the CPU via DMI, which is essentially how all Intel ICH south bridges had connected to the Northbridge. The DMI bus is something rather narrow, which apparently delivers only 2GB/s, or 1GB/s in each way. The beginning of times Intel CPUs use an external bus called Front Side Bus or simply FSB that is shared between memory and I/O requests.

The old FSB architecture is better well known: the Northbridge connected through a wide enough FSB to the processor and had the memory controller attached to it. Typical FSB’ s for Core 2 range from 1066MT/s to 1333MT/s, on a 64bit wide bus. This translates to a one way bandwidth of 8.5GB/s or 10.6GB/s in case of the 1333MT/s(or 1333MHz) of the FSB.

Intel CPUs have an embedded memory controller and thus will provide two external busses: a memory bus for connecting the CPU to the memory and an I/O bus to connect the CPU to the external world. This bus is a new bus called Quick Path Interconnect (QPI).

Each lane transfers 20 bits per time. From these 20 bits, 16 bits are used for data and the remaining 4 bits are used for a correction code called CRC (Cyclical Redundancy Check), which allows the receiver to check if the received data is intact. The first version of the Quick Path Interconnect works with a clock rate of 3.2 GHz transferring two data per clock cycle, a technique called DDR, Double Data Rate making the bus to work as if it was using a 6.4 GHz clock rate (Intel uses the GT/s unit – which means giga transfers per second – to represent this). Since 16 bits are transmitted per time, we have a maximum theoretical transfer rate of 12.8 GB/s on each lane (6.4 GHz x 16 bits / 8).

So compared to the front side bus QuickPath Interconnect transmits fewer bits per clock cycle but works at a far higher clock rate. Currently the fastest front side bus available on Intel processors is of 1,600 MHz (actually 400 MHz transferring four data per clock cycle, so QuickPath Interconnect works with a  base clock eight times higher), meaning a maximum theoretical transfer rate of 12.8 GB/s, the same as QuickPath. QPI, however, offers 12.8 GB/s on each direction, while a 1,600 MHz front side bus provides this bandwidth for both read and write operations – and both cannot be executed at the same time on the FSB, limitation not present on QPI. Also since the front side bus transfers both memory and I/O requests, there are always more data being transferred on this bus compared to QPI, which carries only I/O requests. So QPI will work “less busy” and thus having more bandwidth available.

QuickPath Interconnect is also faster than HyperTransport. The maximum transfer rate of HyperTransport technology is 10.4 GB/s (which is already slower than QuickPath Interconnect), but current Phenom processors use a lower transfer rate of 7.2 GB/s. So Intel Core i7 CPU will have an external bus 78% faster than the one used on AMD Phenom processors. Other CPUs from AMD like Athlon (formerly known as Athlon 64) and Athlon X2 (formerly known as Athlon 64 X2) use an even lower transfer rate, 4 GB/s – QPI is 220% faster than that.

There are obviously physical differences between Lynnfield and Bloomfield. Due to the new Lynnfield Core i5/i7 design changes, a new P55 mother board and LGA 1156 socket were designed to support it. In simpler terms, the Lynnfield is about a quarter inch smaller in size compared to the Bloomfield Core i7. Just make sure when you’re shopping for a Core i7 that you pay attention to the processor socket so you order the right motherboard to accompany it.


We can actually see the differences between both cores. While both architectures are all Quad Core, Bloomfield (above) utilizes dual QPI and is able to use an integrated triple channel memory controller where as the Lynnfield (below) supports a dual channel memory controller.


I7-900 and i7-800 processors:

A good example of how Nehalem micro architecture enables the scaling of energy efficiency and performance can be seen in the Intel core i7 Family. In 2009, Intel launched Core i7-800 processor under the code name Lynnfield, then it launched the i7-900 under code name of Bloomfield. Both of them based on the Nehalem Micro Architecture.

The Lynnfield architecture is quite similar to Bloomfield, after all both belongs to the Nehalem family and are produced in 45 nanometers technology. Thus the Front Side Bus known from the Core 2 is replaced by DMI in Lynnfield. But this connection (DMI) runs slower than the QPI of the Bloomfield and it can’t interact with other processors on the motherboard, so the Lynnfield is definitely not the right processor for multi socket systems. Furthermore the integrated memory controller of the Lynnfield supports Dual Channel only. Besides that there are no significant differences. The three cache levels are as big as those of the Bloomfield L1 cache of 32 KB Instruction Cache and 32 KB Data Cache, L2 cache per core for very low latency 256 KB per core for handling data and instruction, and new fully inclusive, fully shared 8 MB L3 cache.

I7-900 features VS i7-800 features:

I7-900 series editions:

I7-900 Power Consumption:

Bandwidth and latency of i7:

Generally speaking, the faster the processor, the higher the system wide bandwidth and the lower the latency. As is always the case, faster is better when it comes to processors, as we’ll see below. But with Core i7, the game changes up a bit.

Integer and float point operations bandwidth:

Memory Latency:

In terms of latency, not much has changed, even with the move to an integrated memory controller.

Multi-core efficiency:

How fast can one core swap data with another? It might not seem that important, but it definitely is if you are dealing with a true multi-threaded application. The faster data can be swapped around, the faster it’s going to be finished, so overall, inter-core speeds are important in every regard. Even without looking at the data, we know that Core i7 is going to excel here, for a few different reasons. The main is the fact that this is Intel’s first native Quad-Core. Rather than have two Dual-Core dies placed beside each other, i7 was built to place four cores together, so that in itself improves things. Past that, the ultra-fast QPI bus likely also has something to do with speed increases.

As we expected, Core i7 can swap data between its cores much faster than previous processors, and also manages to cut down significantly on latency. This is another feature to thank HyperThreading for, because without it, believe it or not, the bandwidth and latencies are actually a bit worse, clock-for-clock, as we’ll see soon.

In conclusion,

Nehalem is about improving HPC (high performance Computing), Database, and virtualization performance, and much less about gaming performance. Nehalem is only a small step forward in integer performance, and the gains due to slightly increased integer performance are mostly negated by the new cache system that’s because most games really like the huge L2 of the Core family. With Nehalem they are getting a 32KB L1 with a 4 cycle latency, next a very small 256KB L2 cache with 12 cycle latency, and after that a pretty slow 40 cycle 8MB L3 COMPARED TO Penryn which use a 3 cycle L1 and a 14 cycle 6144KB L2. The Penryn L2 is 24 times larger than on Nehalem.

The percentage of L2 caches misses for most games running on a Penryn CPU is extremely low. Now that is going to change. The integrated memory controller of Nehalem will help some, but the fact remains that the L3 is slow and the L2 is small. However, that doesn’t mean Intel made a bad choice Because Nehalem wasn’t made for the gaming, it was made to please the IT and HPC people.


For more information about Intel’s Quick Path Technology, please visit Intel’s website to see this demo:  “” .

Multi- and Many-Core processors are here to stay for a really long time. They are microprocessors manufacturers response to the uni-core scalability walls. However, although software communities explored the parallel programming models heavily in the 80s and 90s, but these efforts were directed to less finer grained systems, mainly clusters, parallel machines, and Symmetric Multi Processors (SMP). Multi- and many-core architectures poped to the surface some classical problems, such as memory latency, data synchronization, and threads management. Also, they introduced new problems of massively parallel systems with to a great extent fine grained threading models, such as managing thousands of concurrent threads and inter-thread communication and data sharing. In this posting I would like to pinpoint some of these challenges from my research programming experiences on multi-core architectures.

Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system’s main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.

Memory Latency Wall

As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system’s main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core’s cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.

Data Synchronization

Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system’s main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.

Programming Complexity

Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application’s adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.

Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That’s why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel’s Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs.