December 2009

I’m only summarizing here some sections from the Roadmap document of the International Extascale Software Project. This project has on board well-known scientists and professionals shaping the next software stack for machines with millions of cores with exascale computing power. This document is aggregating thoughts and ideas about how HPC and multi-core community should remodel the software stack. Machines with millions of cores will be a reality in few years if Moore’s law continues to work (e.g. doubling the number of cores inside microprocessors every 18 months and selling them 50% cheaper). HPC community should make serious efforts to be ready for such massively parallel machines. Authors believe that this is best achieved if we focus our efforts on understanding expected architectural developments and possible applications that will run on these platforms. The roadmap document is defining four main paths to the exascale software stack:

  1. Systems software, which includes: operating systems, I/O systems, systems management, and external environments.
  2. Development environments, which includes: programming models, frameworks, compilers, numerical libraries, and debugging tools.
  3. Applications, which includes: algorithms, data analysis and visualization, and scientific data management
  4. Cross Cutting Dimensions, which include important components, such as power management, performance optimization, and programmability.

Suggestions of changes in these paths depend on some important technology trends in HPC platforms:

  1. Technology Trends:
    1. Concurrency: processors’ concurrency will be very large and finer-grained
    2. Reliability: increased number of transistors and functional units will impose a reliability challenge
    3. Power consumption: it will increase from 10 Mega Watts these days to be close to 100 Mega Watts around 2020.
  2. Science Trends. Many research and governmental agencies specified upcoming scientific challenges that HPC can help as we cross into the exascale era, major research areas: climate, high-energy physics, nuclear physics, fusion energy sciences, nuclear energy, biology, materials science and chemistry, and national nuclear security
  3. Politico-economic trends: HPC is growing 11% every year. Main demands come from governments and large research institutions. It will continue this way since a lot of commercial entities depend on capacity rather than computing power to expand their infrastructures.


I’m interested in quickly sharing with you some key points from each of these paths. I think this document is more visionary compared to Berkeley’s former document: The Landscape of Parallel Computing Research. My next posting will be summarizing the first path of systems software.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


Most probably your knowledge about the Cell processor and curiosity led you to this blog post.

Well, I had the same curiosity after working for a while with the Cell processor. I asked myself this question: Is the DMA latency for all SPEs the same? In other words, if one SPE is making a DMA request for exactly the same size data chunk, would it be delivered to its local storage in the same time of other SPEs?

The short and proven answer is: NO

Each SPE has a different DMA latency due to its physical location or its distance from the memory controller. There is only one memory controller inside the initial Cell implementation. The physical distance from memory controller makes a considerable difference of memory latency from one SPE to another. This latency difference gets event bigger as the DMA chunk gets bigger. For example, nearest SPE to the memory controller retrieves 4 KB from main memory in around 2000 nano seconds, while the farthest SPE receives the same chunk in around 420 nano seconds.

So, what does this mean? Or should I care about this?

Well, this means simply that double or multi buffering does not hide memory latency inside all SPEs with the same efficiency. SPEs located physically near the memory controller can have almost all the memory latency hidden, but far ones may still suffer from some latency.  You can download the code from here and test yourself if you have a Cell machine. It will  not work on the simulator, since it does not simulate the DMA latency, even in cycle mode.

If you are using double buffering, you are still getting a better performance compared to a single buffer. However, you are not getting the best possible performance. There is still more room for improvement.

If you have the new Power Cell 8Xi, this fact might be different since there are two memory controllers on that Cell implementation. Please share your numbers with us if you have it. You can find my measures here.

Here is my trial to gather all the upcoming events related to High Performance Computing (HPC) and multi-core technologies. Please feel free to tell me about more related events and share it with all readers. I will be updating this page as I find more related upcoming events. Each event is linked to announcement article or company post.



CFP Date

Notification Date



HiPC 2010



December 2010


SC 2010



November 13-19, 2010

New Orleans, LA

IPDPS 2010

September 21, 2009

December 7, 2009

April 19-23, 2010

Atlanta, GA

PARA 2010

September 1, 2009

March 1, 2010

June 6-9, 2010

Reykjavík, Iceland

HPCS 2010

January 15, 2010

March 15, 2010

June 28 – July 2, 2010

Caen, France

ICCS 2010

January 3, 2010


May 31 – June 2, 2010

University of Amsterdam, The Netherlands

HotPar ’10

January 24, 2010    

Early March 2010

June 14–15, 2010

Berkeley, CA

HPCC 2010

March 31, 2010

May 15, 2010

September 1-3, 2010

Melbourne, Australia


January 22, 2010

March 30, 2010

June 21-25, 2010

Chicago, Illinois

ICPP 2010

February 24, 2010

May 25, 2010

September 13-16, 2010

San Diego, CA

ISCCA 2010

October 31, 2009

December 7, 2009

March 23-26, 2010

Fukuoka, Japan

ISCA 2010

November 9, 2009


June 19-23, 2010

Saint-Malo, France


September 18, 2009

December 4, 2009

April 25-27, 2010

White Plains, NY

SIAM Conference on Parallel Processing for Scientific Computing

September 22, 2009


February 24-26, 2010

Seattle, Washington






First Yearly Workshop of the Hybrid Multicore Consortium (HMC)

January 19 -22,


The goal of HMC is to address the migration of existing applications to accelerator based systems and thereby maximize the investments in these systems. The main focus is on identifying obstacles to making emerging, large-scale systems based on accelerator technologies production-ready for high-end scientific calculations.

Summer School 2010 Workshops – Workshop I: Scaling to Petascale

July 6–9, 2010


Provided by The Virtual School of Computational Science and Engineering

Summer School 2010 Workshops – Workshop II:
Many-Core Processors

August 2–6, 2010


Summer School 2010 Workshops – Workshop III:
Big Data for Science

July 26–30, 2010




HPC Related Products Releases




NVIDIA New Fermi Processor

Around March

This is an estimate based on some online news

ATI Radeon HD 5970

January 2010

Although this product has been manufactured and tested, but it is still not available for public till beginning of the year.

IBM Power 7

Q3 2010


AMD Magny-Cours with 16 Cores

Q3 2010

This date is based on rumors from the Internet. No official date yet to the best of my knowledge

Intel Core i9 (Gulftown) with 6 cores

Q1 2010


Intel Nehalem-EX with 8 Cores

Q3 2010


Sun Rainbow Falls

Q3 2010



Super Computers to Be In Full or Partial Service





Q3 of 2010

Hopper, named in honor of American Computer Scientist Grace Hopper, is arriving in two phases. The Phase I system is a Cray XT5 with 664 compute nodes each containing two 2.4 GHz AMD Opteron quad-core processors (5312 total cores). The second phase, arriving in 2010, will be combined with an upgraded phase 1 to create NERSC’s first peta-flop system with over 150,000 compute cores

The Blue Waters project

Q4 2010 or Q1 2011

A 10 Peta FLOPS super computer based on IBM Power7 Architecture


Q1 2011

The system will be based on Fujitsu’s upcoming Sparc64 VIIIfx processor, which has eight processor cores and will be an update to the four-core Sparc64 VII chip that Fujitsu released two years ago. It will offer initially 10 Peta FLOPS computing power!

Multi-core processors are supposed to give us a better performance because simply they are out there for that sole purpose. But the question is: Which one would be the best to get the job done?

Well, the definite answer for this question is: it depends!

From my experiences with different architectures the best payback is from what you really want from them. I’m not talking here about the end user experiences I’m trying to handle this question from the developer’s perspective, mainly for applications that are compute intensive, such as scientific applications, data intensive applications, and discrete algorithms with considerably complex computations.

So, if the answer is: it depends, I think it depends on:

  • Which problems your machine will be working on?
  • How fast you want the code to be written 
  • How much you would like to pay for the hardware/software

Of course life is much more complex to only consider what I write here as the only guidelines to make up your mind. I’m only pinpointing important factors for your decision. Also, please feel free to ask me questions end of this post if you feel that I’m missing other important factors. I will also write in my upcoming posts more related details that would help you. I’m assuming that you are new to multi-core programming and would like to get started and tap new domains of programming.

Which problems your machine will be working on?
If you are intending to solve problems that you need to do in parallel for huge data sets, I would recommend using simple multi-core processors such as the GPGPUs or the Cell Broadband Engine. For example, one core inside the Cell Broadband Engine, Synergistic Processing Element (SPE), can give you around 25.5 GFLOPS; meanwhile one Intel Xeon core can give you around 9.6 GFLOPs. Inside the Cell processor you have 8 cores totaling around 205 GLFLOPs available for you. Also if you consider the GPGPUs you can get up to 1700 GFLOPS per one GPU card compared to a total of 96 GFLOPS for the latest Quad-2-Core general purpose processor from Intel. This category of applications or algorithms includes, but not limited to: string matching and searching, sorting large data sets, FFT computations, data visualization, network traffic scanning, and artificial intelligence.

However, if you are developing applications that can work independently and perform a lot of I/O, you would select the general purpose multi-core processors such as the Power7, Intel’s Quad-Core, AMD Phenom I & II, or Sun’s Spark T1 & T2 processors. These processors are providing in some cases up to 32 concurrently running threads, combining both multi-core and multi-threaded architectures. For example, building a web application and distributing the workload over three tiers: web server, business logic, and the database layers, would perfrom better on these architectures. Each one of these layers is I/O oriented; in addition, each one works as independent process. Each core in such case can handle a layer efficiently without affecting the execution of other cores. In addition, synchronization among them is handled by the operating system through the inter-process communication APIs provided by the OS.

How fast you want to start coding?
When it comes to ease or difficulty of multi-core programming, there are three levels of difficulty. Each level is trading off the difficulty with performance. I’ll start with the easiest and most common one.
Conventional general purpose multi-core processors are the easiest to program and get running parallel applications on top of them. You can still use the old programming models using PThreads or OpenMP to program multi-threaded applications. You don’t have to study the underlying hardware. You will have to only refresh your old parallel programming knowledge. In addition, if you are developing coarse grained parallel applications. You can run it as independent processes and use the old synchronization APIs provided by the operating system. What this programming model is trading off are: (1) Limited scalability: you cannot create and easily manage a lot of threads (I’m talking about 100s of threads) because of the hardware and programming model limitations, (2) You don’t have a lot of space to maneuver and do architecture specific optimization; the cache and threads scheduling is done on behalf of you, which may not provide the best performance for your application. So you can spend few days reviewing your old knowledge and you will be ready to produce parallel applications out of these models.

The next level is the new programming models recently built on top of the GPGPUs such as CUDA and OpenCL. These models are built to combine both conventional serial programming models with the new kernel based parallel programming paradigm. These would allow you to write the program and input initialization code in a serial fashion as it used to be; and then offload the compute intensive part to the GPU card. The kernel function gets executed by many threads; each thread is running in its own context. You should use the synchronization primitives provided by the framework for these offload threads. Of course, these platforms are for applications with fine grained parallelism. The main advantages that this model may give you: (1) Automatic management of many threads, 100’s of threads; the framework will create, manage and destruct the threads contexts transparently; (2) Hiding some of the architectural complexities; for example, you don’t have to manage the cache or pipelines optimizations. The tradeoffs of these models are: (1) Ease of programming; it is now more difficult to program with these models since your application or algorithm must be aware of the architectural heterogeneity. Also you will have to understand the some of the architectural aspects of the GPU card, such as the memory hierarchy; however, these aspects are not as complex as in the third model, (2) some performance is lost due to programming model abstractions. For example, you can create threads more than the number of available cores; this may delay the overall execution if these threads are synchronizing through single or few shared variables. It is recommended to use this model if you have highly parallel problems and don’t want to focus on most of the architectural aspects. You may need a week or two to understand the model and architecture before you start programming.

The third level is challenging but provides many distinction points for you as a developer or researcher who selects this programming model. I see only the Cell Broadband Engine as the only processor in this category. You can still program your parallel application using PThreads or OpenMP. However, this is only to create and kill the threads. To synchronize and properly implement your algorithm you will have to understand all the architectural aspects of the microprocessor, such as the DMA for cache management inside the Cell processor. The main advantage that this model provides is the great flexibility in utilizing the available resources to get best performance. It worth mentioning here that one Cell processor with only 9 cores provides around 50% of the peak performance of the NVIDIA GTS 285 equipped with 120 cores. The major tradeoff in this model is the difficulty of programming. You may need few weeks to study the architecture and the programming model before you can start writing your code. Also you will have to spend even more time to optimize your application and get the best possible performance.


How much you would like to pay for the hardware and software?
You cannot isolate the price from the performance you can get out of the your processor. We can simply use the simple ration of dollars for each GFLOP. Table Below can tell you more about this.



Theoretical Peak GFLOPs

Cost of 1 GFLOP

Intel Quad
Core Q9650




AMD Phenom
II X4 965





Not Yet Released

GeForce GTX 295




ATI Radeon
HD 5870




Processor (Inside PS3)




Processor (One Blade)




Don’t forget that the GPGPUs should be hosted by a full machine, which adds to the total cost per one GFLOP.

Can I use combinations of different architectures?
Yes, and you should be creative about that. For example, you can build your web application on top of a general purpose multi-core processor and use a specialized simple multi-core architecture to do parts of the workload. You can attach a GPU card to your machine and use it to search in large datasets or you can use it to do sorting and filtering of large search results. In this case you are combining heterogeneous architectures to get the best out of each. You can do this at the network level. You can build a hybrid cluster to make each node handle different parts of the workload. For example, nodes that do I/O intensive should have powerful general purpose multi-core processors utilizing multi-threaded architectures. These processors are very good in scheduling their threads and processes for I/O intensive applications. And you can allocate nodes with excellent processing power to do data filtering, sorting, or any other compute intensive tasks of your application.