Future of Multi-Core processors

If you are taking computer architecture classes, studying electronics, or doing research related to microprocessors, you may have heard of warnings about the end of Moore’s law: microprocessors will soon not be able to double their performance every 18 months for some physical limits related to making transistors smaller and keeping them fairly efficient in power consumption.

Microprocessors depended mainly on three main factors to keep Moore’s law in effect: (1) Reducing transistor size, so that we can have more in the same area with more sophisticated execution logic and be able to fit in more cache, (2) Increasing Transistor frequency, to execute more instructions, and (3) Economics of manufacturing, to keep the next generation of microprocessors affordable to everyone. Right now it is difficult to cram more transistors due to current limits on lithography. Also, as transistors get smaller and operate at higher frequencies, their power consumption is increasing at greater rates than the increase in performance. Finally, the manufacturing cost is increasing astronomically as we move from one generation to another.

I think Moore’s law may not live as it stands right now. The pattern may keep going by through different means. Here is my stab on it:

Reconsidering the execution pipelines to have shorter latency time per instruction.

Increasing the number of cores, which increase the overall throughput. This is possible through making a better use of the total number of transistors that can fit in one chip. It is possible to work since pipelines should be of less depth.

Homogeneity of instructions set and heterogeneity in implementations. For example, a multi-core processor may have 32 cores with same arithmetic and logical operations instructions, but only two or four of them implement other system control instructions, such as protected mode and interrupt handling instructions. Applications may not need radical rewriting in this case. Actually we can automate the process of migrating these traditional multi-threaded applications to this new heterogeneous architecture.


Welcome to the sixth part summarizing the roadmap document of the international exascale software project (IESP). This part is moving higher in the software stack and discussed the challenges, technology drivers and suggested research agenda for the exascale applications. It focuses on algorithms, data analysis and visualization, and scientific data management.


Original authors of this subsection are: Bill Gropp (UIUC), Fred Streitz (LLNL), Mike Heroux (SNL), Anne Trefethen (Oxford U., UK)

Authors see the following important drivers for the exascale systems algorithms:

  • Scalability to millions or billions of threads working collaboratively or independently.
  • Fault tolerance and fault resilience. Algorithms should be able to detect faults inside systems and recover from them with least possible resources away from the current traditional check pointing techniques.
  • Match algorithm’s compute demands with available capabilities. Authors are mainly focusing on power aware algorithms. This is attainable by considering accurate performance modeling, utilizing heterogeneous architectures, awareness of memory hierarchy and performance constraints of each component inside the exascale systems.

Authors are then discussing the critical challenges. However, they stress the importance of developing performance models against which algorithm developments can be evaluated.

I could not find in this section clear research agenda. However, authors are suggesting the following critical challenges for applications in the era of exascale computing:

  • Gap analysis – need to perform a detailed analysis of the applications, particularly with respect to quantitative models of performance and scalability.
  • Scalability, particularly relaxing synchronization constraints
  • Fault tolerance and resilience, including fault detection and recovery
  • Heterogeneous systems – algorithms that are suitable for systems made of functional units with very different abilities

Data Analysis and Visualization

Original contributors of this subsection are: Michael E. Papka (ANL), Pete Beckman (ANL), Mark Hereld (ANL), Rick Stevens (ANL), John Taylor(CSIRO, Australia)

According to Wikipedia Visualization is the transformation, selection or representation of data from simulations or experiments, with an implicit or explicit geometric structure, to allow the exploration, analysis and understanding of the data. As simulations become more complex generating more data, visualization is gaining more importance in scientific and high performance computing. The authors accordingly believe that the following are the possible alternative R&D options:

  • Develop new analysis and visualization algorithms to fit well within the new large and complex architectures of exascale computing.
  • Identify new mathematical and statistical approaches for data analysis
  • Develop integrated adaptive techniques to enable on the fly and learned pattern performance optimization from fine to coarse grain.
  • Expand the role of supporting visualization environments to include more pro-active software: model and goal aware agents, estimated and fuzzy results, and advanced feature identification.
  • would be advantageous to invest in methods and tools for
  • Keep track of the processes and products of exploration and discovery. These will include aids to process navigation, hypothesis tracking, workflows, provenance tracking, and advanced collaboration and sharing tools.
  • Plan deployment of global system of large scale high resolution (100 Mpixel) visualization and data analysis systems based on open source architecture to link universities and research laboratories

Hence, authors are shooting for the following R&D strategy to make data analysis and visualization develop at the right pace of developing the new exascale systems:

  • Identification of features of interest in exabytes of data
  • Visualization of streams of exabytes of data from scientific instruments
  • Integrating simulation, analysis and visualization at the exascale

Scientific Data Management

Original contributor of this subsection is: Alok Choudhary (Northwestern U.)

Managing scientific data has been identified by the scientific community as one of the most important emerging needs because of the sheer volume and increasing complexity of data. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Although file systems are scalable now to store huge number of files and scientific data, this is considered not enough for the scientific computing community. Another layer should be built on top of these file systems to provide easy tools to store, retrieve and analyze huge data sets generated by the scientific analysis algorithms.

Accordingly, the author is suggesting considering these alternatives and R&D strategies to build effective scientific data management tools and frameworks:

  • Building new data analysis and mining tools for knowledge discovery from massive datasets produced and/or collected.
  • Designing scalable workflow tools with easy-to-use interfaces would be very important for exascale systems both for performance and productivity of scientists as well as effective use of these systems.
  • Investigate new approaches to build database systems for scientific computing that scale in performance, usability, query, data modeling and an ability to incorporate complex data types in scientific applications; and that eliminate the over-constraining usage models which are impediments to scalability in traditional databases.
  • Design new Scalable Data Format and High-level Libraries without binding these efforts with the backward compatibility. Exascale systems will use different I/O architectures and different processing paradigm. Spending significant effort on backward compatibility would put these systems’ viability on the edge.
  • New technology is required for efficient and scalable searching and filtering of large-scale scientific multivariate datasets with hundreds of searchable attributes to deliver the most relevant data and results would be important.
  • Wide-area data access is becoming an increasingly important part of many scientific workflows. In order to most seamlessly interact with wide-area storage systems, tools must be developed that can span various data management techniques across wide area integrated with scalable I/O, workflow tools, query and search techniques.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

This is the final post summarizing the fourth section of the IESP Roadmap document. It is discussing some important crosscutting dimensions of the upcoming exascale systems. It discusses areas of a concern of all users and engineers of exascale system. This section focuses on: (1) Resilience, (2) Power Management, (3) Performance Optimization, and (4) Programmability. Although they are critical it is very difficult to study them independently from all other components of the exascale systems. I think they should be integral parts of each software layer in next generation of HPC systems. They are already thought of in current and past HPC systems, but they are done at very limited scale. For example, power management is considered only at the OS layer and performance optimization at the application level.


Original contributors of this subsection are: Franck Cappello (INRIA, FR), Al Geist (ORNL) Sudip Dosanjh (SNL), Marc Snir (UIUC), Bill Gropp (UIUC), Sanjay Kale (UIUC), Bill Kramer (NCSA), Satoshi Matsuoka (TITECH), David Skinner (NERSC)

Before summarizing this section I followed some of the authors and found out an interesting white paper by same authors, except Satoshi Matsuoka, discussing software resilience in more depth. I highly recommend you to read it. Its title is: Toward Exascale Resilience, and you can find it here.

The main upcoming challenge in building resilient systems for the era of exascale computing is the inapplicability of traditional checkpoint/restart techniques. Having millions of threads would consume considerable time, space, and emerging to checkpoint their states. New resilience techniques are required to minimize overheads of resilience. Given this general picture, authors believe the following would be the main drivers of R&D in resilient exascale computing:

  • The increased number of components (hardware and software components) will increase the likelihood of having failures even in short execution tasks.
  • Silent soft errors will become significant and raise the issues of result and end-to-end data correctness.
  • New storage and memory technologies, such as the SSD and phase change memory, will bring with them great opportunities for faster and more efficient state management and check-pointing.

I recommend you to read the authors’ original white paper to read more about these challenges.

Authors also did a quick gap analysis to quickly pinpoint in more details areas of fault-tolerance that need rethinking. Among these points:

  • The most common programming model, MPI, does not offer a paradigm for resilient programming.
  • Most of the present applications and system software are not fault tolerant nor fault aware and are not designed to confine errors/faults.
  • Software layers are lacking communication and coordination to handle faults at different levels inside the system.
  • Deeper analysis of the root causes of different faults is mandatory to find efficient solutions.
  • Efficient verifications of global results from long executions are missing as well.
  • Standardized matrices to measure and compare resilience of different applications against are missing so far.

Authors see many possibilities and a lot of complexities to reach resilient exascale systems. However, they conclude of focusing research in two main threads:

  • Extend the applicability of rollback toward more local recovery.
  • Fault avoidance and fault oblivious software to limit the recovery from rollback.

Power Management

Original contributors of this subsection are: John Shalf (LBNL), Satoshi Matsuoka (TITECH, JP)

Power management for exascale systems is to keep best attainable performance with minimum power consumption. This comprises allocating power to system components actively involved in application or algorithm execution. According to the authors, existing power management infrastructure has been derived from consumer electronic devices, and fundamentally never had large-scale systems in mind. Existence of cross-cutting power management infrastructure is mandatory. Absence of such infrastructure will force the reduction of exascale systems scale and feasibility. For large HPC systems power is part of the total-cost-of-ownership. It will be a critical part of exascale systems management. Accordingly, authors are proposing two alternative R&D strategies:

  • Power down components when they are underutilized. For example, the OS can reduce the frequency and operating voltage of a hardware component when it is not used for relatively long time.
  • Explicitly manage data movement, which is simply avoid unnecessary data movement. This should reduce power consumption in networks, hard-disks, memory, etc.

Authors suggest five main research areas for effective power management inside exascale systems:

  • OS based power management. Authors believe that two changes should be considered: (1) Fair shared resources management among hundreds or thousands of processors on the same machine, (2) Ability to manage power levels for heterogeneous architectures inside the same machine, such as GPGPUs
  • System-Scale Resource Management. Standard interfaces need to be developed allowing millions of cores work in complete synchrony to implement effective power management policies.
  • Algorithms. Power aware algorithms are simply those algorithms that would reduce communication overhead for each FLOP. Libraries should be considered to articulate the tradeoffs between communication, power, and FLOPs.
  • Libraries. According to the authors, library designers need to use their domain-specific knowledge of the algorithm to provide power management and policy hints to the power management infrastructure.
  • Compilers. Compilers should make it easier to program for power management by automatically instrument code for power management.
  • Applications. Applications should provide power aware systems and libraries hints about their power related policies for best power optimization.

Given these possible research areas going across the whole software stack, authors believe that the following should be the key metrics to get effectively manage power consumption of exascale systems:

  • Performance. Ability to predict execution pattern inside applications would help in reducing power consumption while attaining the best possible performance.
  • Programmability. Applications developers are not expected to do power management explicitly inside their applications. Coordination between all layers of the software stack should be possible for power management.
  • Composability. Power management components built by different teams should be able to work in harmony when it comes to power management.
  • Scalability, which requires integration of power management information for system wide power management policies.

Performance Optimization

Original contributors of this subsection are: Brend Mohr (Juelich, DE), Adolfy Hoisie (LANL), Matthias Mueller (TU Dresden, DE), Wolfgang Nagel (Dresden, DE), David Skinner (LBL) Jeffrey Vetter (ORNL)

That’s one of my favorite subsections. Expected increase in hardware and software stack complexity makes performance optimization a very complex task. Having millions or billions of threads working on the same problem requires different ways to measure and optimize performance. Authors believe that these areas are important in performance optimization for exascale systems: statistical profiling, techniques like automatic or automated analysis, advanced filtering techniques, on-line monitoring, clustering and analysis as well as data mining. Also authors believe that self-monitoring, self-tuning frameworks, middle ware, and runtime schedulers, especially at node levels, are necessary. Capturing system’s performance under constraints of power and reliability need to be radically changed. Significant overhead may take place to aggregate performance measurements and analyze them while system is running if not properly designed with the new tools. Authors believe that the complexity of exascale systems makes performance optimization in many configurations beyond humans’ manual abilities to monitor and optimize performance. They see that auto-tuning will be an important technique for performance optimization. Hence, authors believe that research in performance optimization should be directed to these areas:

  • Support for modeling, measurement, and analysis of heterogeneous hardware systems.
  • Support for modeling, measurement and analysis of hybrid programming models (mixing MPI, PGAS, OpenMP and other threading models, accelerator interfaces).
  • Automated / automatic diagnosis / autotuning.
  • Reliable and accurate performance analysis in presence of noise, system adaptation, and faults requires inclusion of appropriate statistical descriptions.
  • Performance optimization for other metrics than time (e.g. power).
  • Programming models should be designed with performance analysis in mind. Software and runtime systems must expose their model of execution and adaptation, and its corresponding performance through a (standardized) control mechanism in the runtime system.


Original contributors of this subsection are: Thomas Sterling (LSU), Hiroshi Nakashima (Kyoto U., JP)

Programmability of exascale systems is another critical factor for their success. It is quite difficult to benchmark it and find a baseline to set and measure our objectives in this area. However, authors identified the following basic challenges of systems’ programmability:

  • Massive parallelism though millions or billions of concurrent collaborating threads.
  • Huge number of distributed resources and difficulty of allocation and locality management.
  • Latency hiding by overlapping computations with communications.
  • Hardware Idiosyncrasies. Different models will emerge with significant differences in ISA, memory hierarchy, etc.
  • Portability. Application programs must be portable across machine types, machine scales, and machine generations.
  • Synchronization Bottlenecks of millions of threads trying to synchronize control or data access.
  • Data structures representation and distribution.

If you have read the other postings summarizing rest of these document, you will realize how complicated programmability is. It is cross cutting all the layers of the software stack, starting from ISA & operating systems and ending with applications. Going through author’s suggested research agenda, I found out that they are recommending all R&D directions proposed by the rest of the authors in their corresponding stack layer/component. I would recommend you to read the other related posting to realize challenges waiting for researchers to make exascale systems easier to program and utilize.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

In my last posting, I summarized three areas of the Systems Software chapter from the International Exascale Software Project (IESP). I continue in this posting and summarize for you the remaining two areas: System Management, and External Environment.

System Management

Original contributors of this section are: Robert Wisniewski (IBM) and Bill Kramer (NCSA)

The authors are dividing the systems management into five subareas that need reconsideration for the exascale computing:

  1. Resource control and scheduling, which includes configuring, start-up and reconfiguring the machine, defining limits for resource capacity and quality, provisioning the resources, and workflow management.
  2. Security, which includes authentication and authorization, integrity of the system, data integrity, and detecting anomalous behavior and inappropriate use.
  3. Integration and test, which involves managing and maintaining the health of the system and performing continuous diagnostics.
  4. Logging, reporting, and analyzing information.
  5. External coordination of resources, which is how the machine coordinates with external components.

Considering these areas and their implications on system management, the following will be the main technology drivers for the systems management:

  • All system management tasks, such as integrating new nodes, moving right data to the right place, and responding to security comprises, must be responsive. In other words, these tasks should be autonomous and proactive to reach proper response requirements.
  • Data movement will be constrained rather than processing time at the exascale computing. Hence, resource control and management – and the utilization logs for resources – has to change focus to communications and data movement.
  • Security management will be more complex. Variability of the system will impose building more security components and integrate them in many subsystems.
  • The effect of security policies on performance will be more significant due to expected exascale system’s complexity. Security tools should be redesigned with performance perspective in mind.

Authors are offering two R&D alternatives for the systems management. First one is to use evolutionary method and extend the terascale and petascale tools. This will result, according to the authors, a lot of inefficiencies in the exascale systems. The second alternative involves borrowing some techniques and policies from telecommunication and real-time systems, such as statistical learning.

The authors are then recommending the research agenda till year 2020. They wrote them in bullet format, which is nice to only list below:

Category 1) “Resource control and scheduling” and “External coordination of resources”

  • Need to better characterize and manage non-traditional resources such as power and I/O bandwidth
  • Determine how to manage and control communication resources – provision and control, different for HPC than WAN routing
  • Determine and model real-time aspects of Exascale system management and feedback for resource control
  • Develop techniques for dynamic provision under constant failure of components
  • Coordinated resource discovery and scheduling aligned

Category 2) “Security”

  • Fine grained authentication and authorization by function/resources
  • Security Verification for SW built from diverse components
  • Provide appropriate “Defense in depth” within systems without performance or scalability impact.
  • Develop security focused OS components in X-stack.
  • Assess and improve end-to-end data integrity.
  • Determine guidelines and tradeoffs of security and openness (e.g. grids).

Category 3) “Integration and test” and “Logging, reporting, and analyzing information”

  • Determine key elements for Exascale monitoring
  • Continue mining current and future Petascale failure data to detect patterns and improvements
  • Determine methods for continuous monitoring and testing without affecting system behavior
  • Investigate improves information filters; provide stateful filters for predicting potential incorrect behavior
  • Determine statistical and data models that accurately capture system behavior
  • Determine proactive diagnostic and testing tools

External Environment

This section is to be filled in the document. But the document is setting the scope of the external environment to refer to the essential interfaces to remote computational resources (e.g. data repositories, real time data streams, high performance networks, and computing clouds).

I will keep an eye on the newer versions of this document and update this section if contributors are chosen for this task.

Next Time

We are done finally with the systems software layer. It is complicated layer but very critical for the success of the exascale software project. This layer is tying all system components together and makes them easier to use by the next layer, the programming and execution models.

My next posting will be summarizing the Development Environment section. It will discuss the technology drivers, upcoming challenges for the exascale systems, and recommended research agenda for the components of the development environment.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

I’m only summarizing here some sections from the Roadmap document of the International Extascale Software Project. This project has on board well-known scientists and professionals shaping the next software stack for machines with millions of cores with exascale computing power. This document is aggregating thoughts and ideas about how HPC and multi-core community should remodel the software stack. Machines with millions of cores will be a reality in few years if Moore’s law continues to work (e.g. doubling the number of cores inside microprocessors every 18 months and selling them 50% cheaper). HPC community should make serious efforts to be ready for such massively parallel machines. Authors believe that this is best achieved if we focus our efforts on understanding expected architectural developments and possible applications that will run on these platforms. The roadmap document is defining four main paths to the exascale software stack:

  1. Systems software, which includes: operating systems, I/O systems, systems management, and external environments.
  2. Development environments, which includes: programming models, frameworks, compilers, numerical libraries, and debugging tools.
  3. Applications, which includes: algorithms, data analysis and visualization, and scientific data management
  4. Cross Cutting Dimensions, which include important components, such as power management, performance optimization, and programmability.

Suggestions of changes in these paths depend on some important technology trends in HPC platforms:

  1. Technology Trends:
    1. Concurrency: processors’ concurrency will be very large and finer-grained
    2. Reliability: increased number of transistors and functional units will impose a reliability challenge
    3. Power consumption: it will increase from 10 Mega Watts these days to be close to 100 Mega Watts around 2020.
  2. Science Trends. Many research and governmental agencies specified upcoming scientific challenges that HPC can help as we cross into the exascale era, major research areas: climate, high-energy physics, nuclear physics, fusion energy sciences, nuclear energy, biology, materials science and chemistry, and national nuclear security
  3. Politico-economic trends: HPC is growing 11% every year. Main demands come from governments and large research institutions. It will continue this way since a lot of commercial entities depend on capacity rather than computing power to expand their infrastructures.


I’m interested in quickly sharing with you some key points from each of these paths. I think this document is more visionary compared to Berkeley’s former document: The Landscape of Parallel Computing Research. My next posting will be summarizing the first path of systems software.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

I’m talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I’m tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system’s perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem’s main memory or to rech core’s registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor’s cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core’s data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.
Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system’s memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.

Ok, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core’s local cache instead of switching it to the system’s main memory or second level cache. The context in this case will be the core’s registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let’s call them micro-threads, stalls for a data chunk to be copied from system’s main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core’s cache. Second, context switching inside each core’s cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core’s cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work.
So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.

In my previous writing I tried to characterize multi-core processors and quickly pinpoint the major distinction features in multi-core processors. In this blog post I would like to briefly share with you some of the untapped aspects inside the Cell Broadband Engine and The GeForce GTX 295.

The Cell Broadband Engine (CBE)
The Cell microprocessor is an innovative heterogeneous multi-core processor built by IBM, Sony, and Toshiba. 400 designers worked closely together for 5 years to invent the new heart for the PlayStation3 (PS3). The PS3 was only a starting point for the CBE. IBM made the first stab and re-introduced the CBE as a highly capable microprocessor for compute intensive jobs.
I won’t repeat the explanation of the CBE architecture. You can find it at Wikipedia, IBM’s website, and many other places if you Google it. I implemented several discrete algorithms, such as graphs traversals and integers sorting, and scientific algorithms as the FFT. I also was able to get into its architectural details and build a proprietary threading model, called micro-threading. This framework hides memory latency in most workloads without engaging the developer into the lowest architectural details of the Cell processor. I also made a series of experiments to characterize the effect of it architectural properties on memory latency pattern in different workloads.
The beauty of the CBE, from my points of view, is in the great extent of control it gives you to reach the best execution time. All the other architectural properties regarding its compute capabilities, multiple level of parallelism, and its on-chip network discussed in many publications and I experienced all these great features during my PhD journey. However, I would like to note the following from some of my experiences.
Although the Element Interconnect Bus (EIB) is of high performance and very low latency, the effect of this topology over cores performance is overlooked by most research efforts. For example, from my memory latency measurement experiments I found out that memory latency differs from one core to the other depending on how far this core is from the memory controller. This does not reflect a flaw in the design, but this is a property that would change the measure of memory latency per each core.  As this topology is used more and have more cores connected to the same ring, the physical location effect will be of more importance. The question is: how would this affect my performance as long as I can use techniques such as multi-buffering and data prefetching? The answer is very simple; as long as the memory latency differs from one core to the other, you need to hide it according to each core’s latency measures. For example, in cores with relatively very high memory latency you can use more buffers to prefetch your data compare to other cores with lower memory latency. I already discussed this in my optimum micro-threading scheduling paper, please have a quick look at it to understand this issue more.
Also, in the new implementation of the CBE  can work now with larger RAM. However, this RAM is based on the DDR3 technology. The initial implementation using the XDR RAM had the limitation of a maximum memory of 1 GB. Although the new CBE implementation has a faster floating point unit and two memory controllers to keep high processor-memory bandwidth, but the memory latency is getting higher due to the high latency of the DDR3 RAM compared to the XDR implementation.

I worked also on NVIDIA GeForce GTX 295 to implement some algorithms in information retrieval. It is still a work in progress and to be published. However, from my insights of the Cell Broadband Engine, I  figured out some architectural properties that worth also sharing with you.
NVIDIA GPUs programming framework is abstracting to a great extent the architectural details of the microprocessor. From productivity point of view, this is a great feature. However, it is leaving few options for researchers and experienced developers to explore different options inside it. For example, I couldn’t find a clear way that would help me scheduling threads to processor’s cores. It is done, not sure yet, by the processor’s scheduler. It is following the same policy used by Intel’s hyper-threading and Sun’s multi-threading architecture. I’m now measuring if memory latency differs from one core to another.  My initial measures show that memory latency is almost the same across all cores. However, I’m a little bit concerned about the relatively many hierarchies built into the processor. I realize that the shared memory model mandates such hierarchy to have reasonable synchronization overhead. However, adding hierarchies to run away from this problem is not the best solution. NVIDIA is still investing in this hierarchical through their new GPU processor, Fermi.

Although the multi-core processors are providing a smart escape from physical limitations of the uni-core processors, but they need thorough architectural analysis to best utilize their resources. I think this is attainable by monitoring different execution patterns and pinpointing bottlenecks. This should make it easier build efficient programming models, run-time systems, and algorithms for multi- and many-core microprocessors. For example, inside the Cell Broadband Engine microprocessor manual cache management and the Element Interconnect Bus (EIB) formed the opportunity to build run-time systems to get best  performance while simplifying the programming model, such as micro-threading, MPI microtask, data prefetching. I think multi- and many-cores processors will evolve through a closed feedback loop, see below.

Whenever a new architecture is being introduced, developers and researchers start implementing different algorithms and applications to get the best out of it. However, performance bottlenecks pops to the surface very soon. Several efforts try to solve these bottlenecks through either tweaking implemented algorithms or building general frameworks that would identify at run-time performance degradation parameters and change them. On the other hand the loop is properly closed if microprocessor designers listen to developers notes and try to hide these bottlenecks through architectural enhancements. This I think was properly handled in the new Fermi architecture of NVIDIA’s GPGPUs. For example, Fermi has now multiple memory controllers each is handling requests of different groups. This should reduce effect of memory requests serialization, which is a serious performance bottleneck.

I believe research teams are moving now from the naive ways of speeding up multi- and many-core processors by doing the old tricks of algorithmic enhancements to digging more into the processor’s architectural features and proposing better programming models backup by run-time libraries and frameworks. This trend is blending compilers, operating systems, parallel programming, and microprocessors architecture together.