Please consider that what I write in my blog is a result of my research and my tracing of different technologies in the market or academia right now. These blog posts may get obsolete quickly depending on how fast technology will change. I will be doing a good job if my expectations become valid and true three or five years from now!
This week, I’ll discuss the general aspects of multi-core processors. I’ll try to explain the anatomy of multi-core processors based on their short history and current developments.
There are many comparative points worth discussing between both homogenous and heterogeneous multi-core processors. Using Gustafson’s scaled speedup or Amdahl’s speedup laws, in most parallel problems the serial part is constant even if the problem size is increasing. A proper speedup is reached as the ration of the serial part to the parallel part of the implemented algorithm increases. Such perspectives allow multi-core processors to reach nice linear speed up as the number of cores used in parallel portions of the problem is increasing. Heterogeneity in multi-core processors may reach better speed up than homogenous multi-core processors for two main reasons. First, availability of one faster and powerful processor performing the serial portion makes it easy to freeze or shorten the serial part execution time. Second, other cores can be simpler and faster to perform the parallel portion in shorter time. Increasing the number of simple cores diminishes the effect of the serial part. Hence, according to Gustafson’s scaled speedup law, if we have a single powerful core handling the serial portion of the program and four times faster than other cores, we may have speeded up equal to:
Where α is the serial part of the algorithm. Hence, as P increases (number of cores used in parallel part) and as the relative power of the core performing serial part is increasing, the effect of the serial part is diminishing.
Using Amdahl’s law, the speedup using heterogeneous multi-core processors is even better. The comparative speedups of homogenous 1000 simple processor design and a heterogeneous 91 processor design relative to a single simple processor are:
Speed up Homogeneous=1/((0.1-0.9/100) )=9.2 times faster
Speed up Heterogeneous=1/((0.1/2-0.9/90) )=16.7 times faster
Given that 10% of the time a program gets no speed up on a 100-processor computer. We are also assuming to run the sequential code twice as fast, a single processor would need 10 times as many resources as a single core runs due to bigger power budget, larger caches, a bigger multiplier, and so on.
In addition, heterogeneous processor solutions can show significant advantages in power, delay and area. Processor instruction-set configurability is one approach to realizing the benefits of processor heterogeneity while minimizing the cost of software development and silicon implementations, but this requires custom fabrication of each new design to realize the performance benefit, and this is only economically justifiable for large markets, such as the Cell BE architecture used to power the new generation of Sony’s Playstation3.
On the other hand, a single replicated processing element has many advantages; in particular, it offers ease of silicon implementation and regular software environment. Manage heterogeneity in an environment with thousands of threads may make a difficult problem impossible.
Homogenous multi-core processors manufacturers are building on their serial processors architecture. They are including on a single chip two or more processors with the same instructions set. Different version of these chips these manufacturers are releasing. Variations are in: cache sharing, cache levels, embedding threading, and cores interconnection.
However, increasing these cores given the physical limitations and the maximum number of transistors that can be utilized on a single chip, processor’s cores will be also different from the ones we encounter right now. As long as parallelism needs to increase and distribute processing load across more cores, it is required to have simpler and fast ones with fewer transistors allocated to each core. So in order to get a proper increase, designers will have to simplify these cores. Of course in this case not all cores will be identical; heterogeneous cores will be another necessity of the speedup requirements. Actually, some existing designs already support this model. The Cell Broadband Engine for example is one of the leading heterogeneous multi-core processors following such path. Also, if we are considering the GPUs as part of the host machine they are considered another form of heterogeneous multi-core computing.
Cache Sharing Models
Cache sharing among processor cores is driving mainly their performance and communication patterns. Of course cache sharing or separation affects the number of transistors and overall power consumptions, but we will discuss mainly here performance and communication outlooks. As shown in the figure below there are two possible designs for cache allocation. In drawing (a) both processors are using the same shared second level cache (L2). Distribution of cache between processor cores can be done either voluntary or hardcoded in each processor’s instructions sets. In the first case the address space of the whole cache is accessible to both processors. If we have a processor with two cores one of them can release some of its cache space to the other core voluntary if the second core is running a memory intensive task. It also provides the advantage of sharing and communicating data using the shared cache space. However, referring to the same example, two cores can compete for the available cache resources. This will result a lot of cache misses to both processors, which now costs each core several hundreds of execution cycles to retrieve data from system’s memory. The second possible design is separate caches per core in, as show in drawing (b). Each core has its own second level cache totally unaffected by other core’s caching policies. Although each processor is independent in cache misses rate, but they are losing shared cache based data communication mean. Creating independent cache for each core is introducing more circuits on the chip and imposing more power and heat overheads on the system.
I think the payback of cache sharing or separation comes from the processor’s usage pattern. If you are doing a lot of sharing and communication among running threads, a shared cache model would be the best. However, shared cache, or shared memory model, has diminishing returns as the number of threads increase, and consequently contention on cache. You should be careful about this. GPU are easing this problem by having more hierarchies to allow different levels of data cache sharing. Such architectural decision should decrease contention. On the other hand, separate cache should allow each core work independently and manage its cache according to the workload it has currently. However, it makes sharing of data more difficult since it must be explicit in this case. Also, such architectural decision adds more load on other shared resources, such as the cores inter-connections, to move data among these separate cache areas.
Cache Management and Control
Cache management and control is one of the abstracted architectural aspects provided by single core processors. Adding more cache and predicting branching in applications were enhanced in each generation of the single core processors. The explicit cache management and control was an optional feature. And in many programming models and tools cache management techniques were overlooked to keep the programming model simple and easy for programmers. However, multi-core processors are becoming more pervasive. They are now manufactured for both general and specialized computing models. Moreover, parallelizing applications on multi-core architectures is still in its early stages. Neither the processors’ designers nor massively parallel applications designers know for sure the best cache management techniques. Therefore, processors manufacturers are giving more cache management and control capabilities back to programmers. No solid design patterns have been proven or frameworks implemented and tested to reveal best general or automated cache management techniques. It is still possible to use hardware managed cache with few cores based processors, i.e. two or four cores as long as they are running general computing applications independently from each other. So they can run and operate similar to Symmetric Multi-Processor (SMP) based machines. However, where there is application specific interdependency between processor’s cores it is better to provide explicitly cache management mechanisms. Cell Broadband Engine (CBE) processors are the first generation of multi-core processors that gives full per-core cache management and control to application developers. Although it is making the programming model harder than the conventional one, backed-up with automated cache management mechanisms, but it is providing application programmers a power tool to hide memory latency by queuing memory load and store requests and using double or quad buffering techniques.
Cache Size and Hierarchy
First Level (L1) and Second Level (L2) caches were always used in single core processors. That leveling of caching was mandatory to hide memory access latency. Many multi-core manufacturers are still using the same hierarchical schema of caching to hide memory latency. However, multi-core processors are now generating more contention on system’s main memory. They have different access patterns to different memory locations. Although multiple level of cahce would enhance sharing of data among processors, but this may dramatically increase cache misses. On the other hand, if each core has its own independent hierarchy of cache, this would also increase significantly number of transistors. It will also add to the memory management complexity. Most of the multi-core processors with 4 cores or more are using small caches and shallow Hierarchy. Reduced cache size adds more space on the chip to include more cores. More cores means more overlapped memory access, which hides naturally its latency. I think shallow hierarchy reduces the cost of cache misses and simplifies cache management mechanisms. For example, the Cell BE processor has eight specialized small cores (SPEs); each has only 256 KB L1 cache (called Local Store or LS) managed explicitly by the programming using Direct Memory Access commands (DMA). Such architecture enabled the Cell BE to reach 25.5 GB/s memory-processor bandwidth. I’m still investigating in the GPGPUs the effect of having multiple hierarchies for different groups of cores inside the microprocessor.
At the level of the physical hardware interconnect, multi-cores have initially employed busses or cross bar switches between the cores and cache banks – as shown in figure below. However, such solutions are not scalable to 1000s of cores. We need on-chip topologies that scale close to linearly with system size to prevent the complexity of the interconnect from dominating cost of many core systems. Scalable on-chip communication networks will borrow ideas from larger-scale packet-switched networks. Already chip implementations such as IBM Cell employ multiple ring networks to interconnect the nine processors on the chip and use software-managed memory to communicate between the cores rather than conventional cache-coherency protocols.