The Emergence of multi-core processors is changing the face of computing forever. Concurrency and parallelism in programs execution will be the dominant speedup factor in both microprocessors design and applications programming. Multi-core processors are now in all kinds of devices with compute capabilities, starting from embedded systems such cell phones and TVs power by Cell and ARM processors, to laptops and desktops powered by Intel and AMD multi-core processors, and to supercomputers such as the Roadrunner running thousands of processors each has at 4 to 9 cores working on the same chip. In addition, multi-core processors will not stay as we experience them nowadays. They will be of tens, hundreds or even thousands of cores working concurrently on the same chip. Microprocessors designers already hit the physical walls of speeding up single-core processors through increasing frequency, deeper pipelines, or adding more cache. To keep catching up with Moore’s law, adding more cores is the key now to double processor’s computing speed every 18 months. So, expect to have more than 10 cores per processor two years from now.
I would like to take you through a series of writings in coming few weeks to discuss the general aspects of multi/many-cores processors. I would like to start by characterizing multi-core processors and go through the general architectural features that they have in common. This might help in understanding where multi-core processors may go. Also, I would like to quickly discuss three critical problems that many researchers in computer science and engineering actively working on. Also I will discuss possible solutions so that you can direct your attention to the useful trends in such exciting technology.

Please consider that what I write in my blog is a result of my research and my tracing of different technologies in the market or academia right now. These blog posts may get obsolete quickly depending on how fast technology will change. I will be doing a good job if my expectations become valid and true three or five years from now!

This week, I’ll discuss the general aspects of multi-core processors. I’ll try to explain the anatomy of multi-core processors based on their short history and current developments.

So, let’s get started ………
Characterizing Multi-Core Processors
Although all designers and microprocessors manufacturers realize that multi-core architecture is the best way nowadays to keep improving microprocessors performance, but they all do not agree on how multi-core processors should be designed and utilized. The differences come from where these processors will be used and also because of the many possibilities to build such architecture. Important architectural decisions are now more complex, such as instructions sets, cores interconnections, cache sharing and size, etc. This complexity gives more possibilities and more tradeoffs in microprocessors designs. I would like this week to pinpoint major architectural decisions and their possible effects on the overall processor’s performance and goodness.
I think design decisions for multi-core designers are in these areas: homogeneity of processor’s cores, cache sharing, cache size and hierarchy, cache management and control, embedded threading, and cores interconnection.
Homogeneity and Complexity of Cores
The first intuition of multi-core processors is to have homogenous cores inside the same processor. However, this is not always the best design for a multi-core processor. It depends on the domain that the new processor is targeting. If it is serving users with desktop sized machines, the main requirement would be a faster processor being able to serve multiple applications at the same time. In such case you may need homogenous processors serving each application independently. This is what you can typically find in Intel, AMD, and Sun Microsystems current multi-core processors. They would like to keep the backward compatibility and achieve good overall machine speed up. Although they are providing better scalability to improve performance compared to single-core processors speed, but they are more like SMP on a chip. They work most of time independently and the synchronization mechanisms are still limited among processor cores. For example, Intel’s dual-core and quad-core processors are connected using a simple multiplexing mechanism, cross bars interconnects, which assumes that these cores are not likely to communicate and exchange data most of the time. Also note that these cores are inheriting the complexity and deep pipelines of the single core architectures. I think these companies wanted to quickly get their multi-core processors in the market by utilizing the old designs into single chip using the new smaller manufacturing technology. When you have a closer look at the performance of each application you would not feel any significant improvement. So in such type of homogenous but older cores designs you will have performance improvement by being able to run more applications concurrently but same speed almost per application. The advantage of such design decision is being able to run all applications designed for single-core processors with no change on these multi-core processors. It is also easier for the operating systems designers to adapt to these changes, since they can consider these multi-core processors as SMP (Symmetric Multiprocessor) systems. Of course, they will improve processes scheduling and inter-process communication, but it is the same overall mechanism.
The other alternative is to have homogenous multi-core processor but with simpler cores designs and possibly different instruction set. For example, most of the GPUs, such as NVIDIA’s GeForce GTX 295, have many cores but they are all identical and simpler in design. Intel also is launching Larrabee microprocessor, which is using some of the older x86 instructions set but simplified in design, such as pipelines with fewer stages. These processors have great performance improvement. Simplicity of cores makes each core execute code faster. Also gives more space to have more cores on the die, which would boost the overall processor’s performance. However, these microprocessors are targeting performance improvement for applications designed to run on parallel architectures or with multi-threaded software design. Such basic requirement makes programming these processors more complex. Since the programmer must understand the architecture and the best way to utilize the huge processing power available on the chip. So, these processors are designed to improve performance for a single parallel application optimized to run smaller but more threads. I’ll discuss the programming complexity later on. In most cases operating systems are taking control over such processors. These homogenous simple core based microprocessors are only managed at the macro-level, i.e. create tasks or deallocate tasks. Operating systems do not, so far, handle threads or resources scheduling on these processors. They are managed by run-time libraries. These processors will give you tremendous performance improvement. For example, the observed performance of the NVIDIA’s GeForce GTX 295 is around 385 Giga Flops. To give a notion of how fast they are compared to traditional processors, Xeon processor can give you up to 9 Giga Flops. They are now of the same price!
Heterogeneous multi-core processors are providing a different way to design highly performing multi-core architecture. Existing embedded multiprocessors, such as the Intel IXP network processing family, keep at least one general-purpose processor on the die to support various housekeeping functions, provide the hardware base for more general operating system support, and execute the serial part of parallel applications. Similarly, the IBM’s Cell BE processor has one general purpose core PPE and eight tailored processing elements SPEs. Keeping a larger processor on chip may help accelerate “inherently sequential” code segments or workloads with fewer threads.

There are many comparative points worth discussing between both homogenous and heterogeneous multi-core processors. Using Gustafson’s scaled speedup or Amdahl’s speedup laws, in most parallel problems the serial part is constant even if the problem size is increasing. A proper speedup is reached as the ration of the serial part to the parallel part of the implemented algorithm increases. Such perspectives allow multi-core processors to reach nice linear speed up as the number of cores used in parallel portions of the problem is increasing. Heterogeneity in multi-core processors may reach better speed up than homogenous multi-core processors for two main reasons. First, availability of one faster and powerful processor performing the serial portion makes it easy to freeze or shorten the serial part execution time. Second, other cores can be simpler and faster to perform the parallel portion in shorter time. Increasing the number of simple cores diminishes the effect of the serial part. Hence, according to Gustafson’s scaled speedup law, if we have a single powerful core handling the serial portion of the program and four times faster than other cores, we may have speeded up equal to:

S(P)=P- α((P-1)/4)

Where α is the serial part of the algorithm. Hence, as P increases (number of cores used in parallel part) and as the relative power of the core performing serial part is increasing, the effect of the serial part is diminishing.

Using Amdahl’s law, the speedup using heterogeneous multi-core processors is even better. The comparative speedups of homogenous 1000 simple processor design and a heterogeneous 91 processor design relative to a single simple processor are:

Speed up Homogeneous=1/((0.1-0.9/100) )=9.2 times faster

Speed up Heterogeneous=1/((0.1/2-0.9/90) )=16.7 times faster

Given that 10% of the time a program gets no speed up on a 100-processor computer. We are also assuming to run the sequential code twice as fast, a single processor would need 10 times as many resources as a single core runs due to bigger power budget, larger caches, a bigger multiplier, and so on.

In addition, heterogeneous processor solutions can show significant advantages in power, delay and area. Processor instruction-set configurability is one approach to realizing the benefits of processor heterogeneity while minimizing the cost of software development and silicon implementations, but this requires custom fabrication of each new design to realize the performance benefit, and this is only economically justifiable for large markets, such as the Cell BE architecture used to power the new generation of Sony’s Playstation3.

On the other hand, a single replicated processing element has many advantages; in particular, it offers ease of silicon implementation and regular software environment. Manage heterogeneity in an environment with thousands of threads may make a difficult problem impossible.

Homogenous multi-core processors manufacturers are building on their serial processors architecture. They are including on a single chip two or more processors with the same instructions set. Different version of these chips these manufacturers are releasing. Variations are in: cache sharing, cache levels, embedding threading, and cores interconnection.

However, increasing these cores given the physical limitations and the maximum number of transistors that can be utilized on a single chip, processor’s cores will be also different from the ones we encounter right now. As long as parallelism needs to increase and distribute processing load across more cores, it is required to have simpler and fast ones with fewer transistors allocated to each core. So in order to get a proper increase, designers will have to simplify these cores. Of course in this case not all cores will be identical; heterogeneous cores will be another necessity of the speedup requirements. Actually, some existing designs already support this model. The Cell Broadband Engine for example is one of the leading heterogeneous multi-core processors following such path. Also, if we are considering the GPUs as part of the host machine they are considered another form of heterogeneous multi-core computing.

Cache Sharing Models
Cache sharing among processor cores is driving mainly their performance and communication patterns. Of course cache sharing or separation affects the number of transistors and overall power consumptions, but we will discuss mainly here performance and communication outlooks. As shown in the figure below there are two possible designs for cache allocation. In drawing (a) both processors are using the same shared second level cache (L2). Distribution of cache between processor cores can be done either voluntary or hardcoded in each processor’s instructions sets. In the first case the address space of the whole cache is accessible to both processors. If we have a processor with two cores one of them can release some of its cache space to the other core voluntary if the second core is running a memory intensive task. It also provides the advantage of sharing and communicating data using the shared cache space. However, referring to the same example, two cores can compete for the available cache resources. This will result a lot of cache misses to both processors, which now costs each core several hundreds of execution cycles to retrieve data from system’s memory. The second possible design is separate caches per core in, as show in drawing (b). Each core has its own second level cache totally unaffected by other core’s caching policies. Although each processor is independent in cache misses rate, but they are losing shared cache based data communication mean. Creating independent cache for each core is introducing more circuits on the chip and imposing more power and heat overheads on the system.

The trend is not yet clear. Different processors using different models of cache sharing. For example, in the GPUs there is a relatively large hierarchy of cache, around 3 levels of caching on the same die. In other processors, such as the Cell Broadband Engine, there is only one level of cache; each core has its own cache managed independently. The disadvantage of this model is the limited cache that can be allocated to each core. The Cell processor, for example, has only 256 KB of cache for both data and instructions per core. However, in the shared cache architectures more cache can be allocated on the chip and possibly that one core can get if other cores are yielding their share of the cache. It depends in this case on the cache allocation policy.

I think the payback of cache sharing or separation comes from the processor’s usage pattern. If you are doing a lot of sharing and communication among running threads, a shared cache model would be the best. However, shared cache, or shared memory model, has diminishing returns as the number of threads increase, and consequently contention on cache. You should be careful about this. GPU are easing this problem by having more hierarchies to allow different levels of data cache sharing. Such architectural decision should decrease contention. On the other hand, separate cache should allow each core work independently and manage its cache according to the workload it has currently. However, it makes sharing of data more difficult since it must be explicit in this case. Also, such architectural decision adds more load on other shared resources, such as the cores inter-connections, to move data among these separate cache areas.

Cache Management and Control
Cache management and control is one of the abstracted architectural aspects provided by single core processors. Adding more cache and predicting branching in applications were enhanced in each generation of the single core processors. The explicit cache management and control was an optional feature. And in many programming models and tools cache management techniques were overlooked to keep the programming model simple and easy for programmers. However, multi-core processors are becoming more pervasive. They are now manufactured for both general and specialized computing models. Moreover, parallelizing applications on multi-core architectures is still in its early stages. Neither the processors’ designers nor massively parallel applications designers know for sure the best cache management techniques. Therefore, processors manufacturers are giving more cache management and control capabilities back to programmers. No solid design patterns have been proven or frameworks implemented and tested to reveal best general or automated cache management techniques. It is still possible to use hardware managed cache with few cores based processors, i.e. two or four cores as long as they are running general computing applications independently from each other. So they can run and operate similar to Symmetric Multi-Processor (SMP) based machines. However, where there is application specific interdependency between processor’s cores it is better to provide explicitly cache management mechanisms. Cell Broadband Engine (CBE) processors are the first generation of multi-core processors that gives full per-core cache management and control to application developers. Although it is making the programming model harder than the conventional one, backed-up with automated cache management mechanisms, but it is providing application programmers a power tool to hide memory latency by queuing memory load and store requests and using double or quad buffering techniques.

Cache Size and Hierarchy
First Level (L1) and Second Level (L2) caches were always used in single core processors. That leveling of caching was mandatory to hide memory access latency. Many multi-core manufacturers are still using the same hierarchical schema of caching to hide memory latency. However, multi-core processors are now generating more contention on system’s main memory. They have different access patterns to different memory locations. Although multiple level of cahce would enhance sharing of data among processors, but this may dramatically increase cache misses. On the other hand, if each core has its own independent hierarchy of cache, this would also increase significantly number of transistors. It will also add to the memory management complexity. Most of the multi-core processors with 4 cores or more are using small caches and shallow Hierarchy. Reduced cache size adds more space on the chip to include more cores. More cores means more overlapped memory access, which hides naturally its latency. I think shallow hierarchy reduces the cost of cache misses and simplifies cache management mechanisms. For example, the Cell BE processor has eight specialized small cores (SPEs); each has only 256 KB L1 cache (called Local Store or LS) managed explicitly by the programming using Direct Memory Access commands (DMA). Such architecture enabled the Cell BE to reach 25.5 GB/s memory-processor bandwidth. I’m still investigating in the GPGPUs the effect of having multiple hierarchies for different groups of cores inside the microprocessor.

Cores Interconnections
At the level of the physical hardware interconnect, multi-cores have initially employed busses or cross bar switches between the cores and cache banks – as shown in figure below. However, such solutions are not scalable to 1000s of cores. We need on-chip topologies that scale close to linearly with system size to prevent the complexity of the interconnect from dominating cost of many core systems. Scalable on-chip communication networks will borrow ideas from larger-scale packet-switched networks. Already chip implementations such as IBM Cell employ multiple ring networks to interconnect the nine processors on the chip and use software-managed memory to communicate between the cores rather than conventional cache-coherency protocols.

Again it is a little bit confusing because both models of using on-chip network topology and implicit interconnection to communicate shared data are still developing for multi-core processors and each of them is providing promising performance. In GPGPUs for example, shared cache is the only way to link cores together. However, in other processors such as IBM’s Cell and Intel’s Larrabee microprocessors on-chip ring topology is employed to link cores together. I see both architectures scalable through introducing multiple communication hierarchies in the shared cache model and having multiple on-chip network segments to reduce contention. It is quite difficult to say which one will dominate because right now we are in a middle of a battle of cores interconnection standards and technologies. IBM is now investigating  the possibility of having optical processors interconnection to reduce power consumption and increase performance. NVIDIA on the other hand is investigating different types of cache and different topologies of cache hierarchy.
I think that’s enough for this week 🙂
Next week I will quickly discuss some of my findings while working with the Cell Broadband Engine and GPGPUs.