January 2010



This is the final post summarizing the fourth section of the IESP Roadmap document. It is discussing some important crosscutting dimensions of the upcoming exascale systems. It discusses areas of a concern of all users and engineers of exascale system. This section focuses on: (1) Resilience, (2) Power Management, (3) Performance Optimization, and (4) Programmability. Although they are critical it is very difficult to study them independently from all other components of the exascale systems. I think they should be integral parts of each software layer in next generation of HPC systems. They are already thought of in current and past HPC systems, but they are done at very limited scale. For example, power management is considered only at the OS layer and performance optimization at the application level.

Resilience

Original contributors of this subsection are: Franck Cappello (INRIA, FR), Al Geist (ORNL) Sudip Dosanjh (SNL), Marc Snir (UIUC), Bill Gropp (UIUC), Sanjay Kale (UIUC), Bill Kramer (NCSA), Satoshi Matsuoka (TITECH), David Skinner (NERSC)

Before summarizing this section I followed some of the authors and found out an interesting white paper by same authors, except Satoshi Matsuoka, discussing software resilience in more depth. I highly recommend you to read it. Its title is: Toward Exascale Resilience, and you can find it here.

The main upcoming challenge in building resilient systems for the era of exascale computing is the inapplicability of traditional checkpoint/restart techniques. Having millions of threads would consume considerable time, space, and emerging to checkpoint their states. New resilience techniques are required to minimize overheads of resilience. Given this general picture, authors believe the following would be the main drivers of R&D in resilient exascale computing:

  • The increased number of components (hardware and software components) will increase the likelihood of having failures even in short execution tasks.
  • Silent soft errors will become significant and raise the issues of result and end-to-end data correctness.
  • New storage and memory technologies, such as the SSD and phase change memory, will bring with them great opportunities for faster and more efficient state management and check-pointing.

I recommend you to read the authors’ original white paper to read more about these challenges.

Authors also did a quick gap analysis to quickly pinpoint in more details areas of fault-tolerance that need rethinking. Among these points:

  • The most common programming model, MPI, does not offer a paradigm for resilient programming.
  • Most of the present applications and system software are not fault tolerant nor fault aware and are not designed to confine errors/faults.
  • Software layers are lacking communication and coordination to handle faults at different levels inside the system.
  • Deeper analysis of the root causes of different faults is mandatory to find efficient solutions.
  • Efficient verifications of global results from long executions are missing as well.
  • Standardized matrices to measure and compare resilience of different applications against are missing so far.

Authors see many possibilities and a lot of complexities to reach resilient exascale systems. However, they conclude of focusing research in two main threads:

  • Extend the applicability of rollback toward more local recovery.
  • Fault avoidance and fault oblivious software to limit the recovery from rollback.

Power Management

Original contributors of this subsection are: John Shalf (LBNL), Satoshi Matsuoka (TITECH, JP)

Power management for exascale systems is to keep best attainable performance with minimum power consumption. This comprises allocating power to system components actively involved in application or algorithm execution. According to the authors, existing power management infrastructure has been derived from consumer electronic devices, and fundamentally never had large-scale systems in mind. Existence of cross-cutting power management infrastructure is mandatory. Absence of such infrastructure will force the reduction of exascale systems scale and feasibility. For large HPC systems power is part of the total-cost-of-ownership. It will be a critical part of exascale systems management. Accordingly, authors are proposing two alternative R&D strategies:

  • Power down components when they are underutilized. For example, the OS can reduce the frequency and operating voltage of a hardware component when it is not used for relatively long time.
  • Explicitly manage data movement, which is simply avoid unnecessary data movement. This should reduce power consumption in networks, hard-disks, memory, etc.

Authors suggest five main research areas for effective power management inside exascale systems:

  • OS based power management. Authors believe that two changes should be considered: (1) Fair shared resources management among hundreds or thousands of processors on the same machine, (2) Ability to manage power levels for heterogeneous architectures inside the same machine, such as GPGPUs
  • System-Scale Resource Management. Standard interfaces need to be developed allowing millions of cores work in complete synchrony to implement effective power management policies.
  • Algorithms. Power aware algorithms are simply those algorithms that would reduce communication overhead for each FLOP. Libraries should be considered to articulate the tradeoffs between communication, power, and FLOPs.
  • Libraries. According to the authors, library designers need to use their domain-specific knowledge of the algorithm to provide power management and policy hints to the power management infrastructure.
  • Compilers. Compilers should make it easier to program for power management by automatically instrument code for power management.
  • Applications. Applications should provide power aware systems and libraries hints about their power related policies for best power optimization.

Given these possible research areas going across the whole software stack, authors believe that the following should be the key metrics to get effectively manage power consumption of exascale systems:

  • Performance. Ability to predict execution pattern inside applications would help in reducing power consumption while attaining the best possible performance.
  • Programmability. Applications developers are not expected to do power management explicitly inside their applications. Coordination between all layers of the software stack should be possible for power management.
  • Composability. Power management components built by different teams should be able to work in harmony when it comes to power management.
  • Scalability, which requires integration of power management information for system wide power management policies.

Performance Optimization

Original contributors of this subsection are: Brend Mohr (Juelich, DE), Adolfy Hoisie (LANL), Matthias Mueller (TU Dresden, DE), Wolfgang Nagel (Dresden, DE), David Skinner (LBL) Jeffrey Vetter (ORNL)

That’s one of my favorite subsections. Expected increase in hardware and software stack complexity makes performance optimization a very complex task. Having millions or billions of threads working on the same problem requires different ways to measure and optimize performance. Authors believe that these areas are important in performance optimization for exascale systems: statistical profiling, techniques like automatic or automated analysis, advanced filtering techniques, on-line monitoring, clustering and analysis as well as data mining. Also authors believe that self-monitoring, self-tuning frameworks, middle ware, and runtime schedulers, especially at node levels, are necessary. Capturing system’s performance under constraints of power and reliability need to be radically changed. Significant overhead may take place to aggregate performance measurements and analyze them while system is running if not properly designed with the new tools. Authors believe that the complexity of exascale systems makes performance optimization in many configurations beyond humans’ manual abilities to monitor and optimize performance. They see that auto-tuning will be an important technique for performance optimization. Hence, authors believe that research in performance optimization should be directed to these areas:

  • Support for modeling, measurement, and analysis of heterogeneous hardware systems.
  • Support for modeling, measurement and analysis of hybrid programming models (mixing MPI, PGAS, OpenMP and other threading models, accelerator interfaces).
  • Automated / automatic diagnosis / autotuning.
  • Reliable and accurate performance analysis in presence of noise, system adaptation, and faults requires inclusion of appropriate statistical descriptions.
  • Performance optimization for other metrics than time (e.g. power).
  • Programming models should be designed with performance analysis in mind. Software and runtime systems must expose their model of execution and adaptation, and its corresponding performance through a (standardized) control mechanism in the runtime system.

Programmability

Original contributors of this subsection are: Thomas Sterling (LSU), Hiroshi Nakashima (Kyoto U., JP)

Programmability of exascale systems is another critical factor for their success. It is quite difficult to benchmark it and find a baseline to set and measure our objectives in this area. However, authors identified the following basic challenges of systems’ programmability:

  • Massive parallelism though millions or billions of concurrent collaborating threads.
  • Huge number of distributed resources and difficulty of allocation and locality management.
  • Latency hiding by overlapping computations with communications.
  • Hardware Idiosyncrasies. Different models will emerge with significant differences in ISA, memory hierarchy, etc.
  • Portability. Application programs must be portable across machine types, machine scales, and machine generations.
  • Synchronization Bottlenecks of millions of threads trying to synchronize control or data access.
  • Data structures representation and distribution.

If you have read the other postings summarizing rest of these document, you will realize how complicated programmability is. It is cross cutting all the layers of the software stack, starting from ISA & operating systems and ending with applications. Going through author’s suggested research agenda, I found out that they are recommending all R&D directions proposed by the rest of the authors in their corresponding stack layer/component. I would recommend you to read the other related posting to realize challenges waiting for researchers to make exascale systems easier to program and utilize.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


I was teaching a microprocessors’ design internals course last fall semester. I planned at the beginning to give my students the opportunity to design a toy microprocessor and optimize important performance factors, such as the pipelining, branch prediction, instructions issuance, etc. However, I decided to link them to the industry and give them a project to implement a simple parallel algorithm on the Cell Broadband Engine and monitor critical performance factors. At the end my objective was to teach them through a real processor possible design tradeoffs and their effect on performance and general effectiveness of the microprocessor. So I had 21 teams of 4 or 5 students working on different discrete algorithms, such as sorting, prime checking, matrix multiplication, and Fibonacci calculations. I asked them to submit a report and write at end of it possible architectural improvements to boost the Cell processor’s performance from their project’s experiences. I found out some interesting conclusions that worth sharing here with you. I reworded some of these suggestions and added some details since they are extracted from a different context.

  • Instructions fetching unit inside the SPEs may suffer from starvation if we have a lot of DMA requests that must be served. This can take place because the high priority assigned to the DMA requests inside the SPE. IBM suggests balancing your source code and including IFETCH instruction to give instructions fetching unit time to fetch more instructions from the cache. Some students suggested including a separate instructions cache; this would make instructions fetching independent from the DMA requests or registers load/store instructions. This should solve this problem and avoid some of the coding complexity while programming the Cell. Also given most of the applications written on the Cell, the text size is relatively very small. So if 64KB cache for code is built inside the next generation of the Cell processor may boost performance and guarantee most of the time smooth instructions execution.
  • A lot of the vector intrinsics were for vectors of floats. Many operations were required for vectors of integers. Students had to type cast to floats before using many of the vector operations, which of course may provide inaccurate answers and consumes more time.
  • Of course the most commonly asked improvement is increasing the LS size inside each SPE. The main reason for some students is to include more buffers and utilize more the multi-buffering technique and better performance at the end.
  • Other students went wild and suggested to change the priorities of the DMA, IFETCH, and MEM operations within the SPEs. Instead of having them DMA>MEM>IFETCH, they suggested to invert them to avoid starvation of instructions fetch unit.
  • Another worth mentioning suggestion is to create memory allocation function that would guarantee allocation of large data structures to different memory banks, which would reduce the DMA latency. For example, if we need a large array and each range will be accessed by a different SPE, we can allocate this array into different memory banks to avoid hotspots inside memory while the SPEs are executing. It is already done by the IBM’s FFT implementation on the Cell processors.

Of course I filtered out some suggests that are of a common sense to any programmer, such as avoiding the 16 bytes memory alignment. I was impressed by their ability to understand and pinpoint some serious problems inside the CBE in less than 6 weeks period.

CSEN 702 Class: Thanks!

Next Page »