January 2010

This is the final post summarizing the fourth section of the IESP Roadmap document. It is discussing some important crosscutting dimensions of the upcoming exascale systems. It discusses areas of a concern of all users and engineers of exascale system. This section focuses on: (1) Resilience, (2) Power Management, (3) Performance Optimization, and (4) Programmability. Although they are critical it is very difficult to study them independently from all other components of the exascale systems. I think they should be integral parts of each software layer in next generation of HPC systems. They are already thought of in current and past HPC systems, but they are done at very limited scale. For example, power management is considered only at the OS layer and performance optimization at the application level.


Original contributors of this subsection are: Franck Cappello (INRIA, FR), Al Geist (ORNL) Sudip Dosanjh (SNL), Marc Snir (UIUC), Bill Gropp (UIUC), Sanjay Kale (UIUC), Bill Kramer (NCSA), Satoshi Matsuoka (TITECH), David Skinner (NERSC)

Before summarizing this section I followed some of the authors and found out an interesting white paper by same authors, except Satoshi Matsuoka, discussing software resilience in more depth. I highly recommend you to read it. Its title is: Toward Exascale Resilience, and you can find it here.

The main upcoming challenge in building resilient systems for the era of exascale computing is the inapplicability of traditional checkpoint/restart techniques. Having millions of threads would consume considerable time, space, and emerging to checkpoint their states. New resilience techniques are required to minimize overheads of resilience. Given this general picture, authors believe the following would be the main drivers of R&D in resilient exascale computing:

  • The increased number of components (hardware and software components) will increase the likelihood of having failures even in short execution tasks.
  • Silent soft errors will become significant and raise the issues of result and end-to-end data correctness.
  • New storage and memory technologies, such as the SSD and phase change memory, will bring with them great opportunities for faster and more efficient state management and check-pointing.

I recommend you to read the authors’ original white paper to read more about these challenges.

Authors also did a quick gap analysis to quickly pinpoint in more details areas of fault-tolerance that need rethinking. Among these points:

  • The most common programming model, MPI, does not offer a paradigm for resilient programming.
  • Most of the present applications and system software are not fault tolerant nor fault aware and are not designed to confine errors/faults.
  • Software layers are lacking communication and coordination to handle faults at different levels inside the system.
  • Deeper analysis of the root causes of different faults is mandatory to find efficient solutions.
  • Efficient verifications of global results from long executions are missing as well.
  • Standardized matrices to measure and compare resilience of different applications against are missing so far.

Authors see many possibilities and a lot of complexities to reach resilient exascale systems. However, they conclude of focusing research in two main threads:

  • Extend the applicability of rollback toward more local recovery.
  • Fault avoidance and fault oblivious software to limit the recovery from rollback.

Power Management

Original contributors of this subsection are: John Shalf (LBNL), Satoshi Matsuoka (TITECH, JP)

Power management for exascale systems is to keep best attainable performance with minimum power consumption. This comprises allocating power to system components actively involved in application or algorithm execution. According to the authors, existing power management infrastructure has been derived from consumer electronic devices, and fundamentally never had large-scale systems in mind. Existence of cross-cutting power management infrastructure is mandatory. Absence of such infrastructure will force the reduction of exascale systems scale and feasibility. For large HPC systems power is part of the total-cost-of-ownership. It will be a critical part of exascale systems management. Accordingly, authors are proposing two alternative R&D strategies:

  • Power down components when they are underutilized. For example, the OS can reduce the frequency and operating voltage of a hardware component when it is not used for relatively long time.
  • Explicitly manage data movement, which is simply avoid unnecessary data movement. This should reduce power consumption in networks, hard-disks, memory, etc.

Authors suggest five main research areas for effective power management inside exascale systems:

  • OS based power management. Authors believe that two changes should be considered: (1) Fair shared resources management among hundreds or thousands of processors on the same machine, (2) Ability to manage power levels for heterogeneous architectures inside the same machine, such as GPGPUs
  • System-Scale Resource Management. Standard interfaces need to be developed allowing millions of cores work in complete synchrony to implement effective power management policies.
  • Algorithms. Power aware algorithms are simply those algorithms that would reduce communication overhead for each FLOP. Libraries should be considered to articulate the tradeoffs between communication, power, and FLOPs.
  • Libraries. According to the authors, library designers need to use their domain-specific knowledge of the algorithm to provide power management and policy hints to the power management infrastructure.
  • Compilers. Compilers should make it easier to program for power management by automatically instrument code for power management.
  • Applications. Applications should provide power aware systems and libraries hints about their power related policies for best power optimization.

Given these possible research areas going across the whole software stack, authors believe that the following should be the key metrics to get effectively manage power consumption of exascale systems:

  • Performance. Ability to predict execution pattern inside applications would help in reducing power consumption while attaining the best possible performance.
  • Programmability. Applications developers are not expected to do power management explicitly inside their applications. Coordination between all layers of the software stack should be possible for power management.
  • Composability. Power management components built by different teams should be able to work in harmony when it comes to power management.
  • Scalability, which requires integration of power management information for system wide power management policies.

Performance Optimization

Original contributors of this subsection are: Brend Mohr (Juelich, DE), Adolfy Hoisie (LANL), Matthias Mueller (TU Dresden, DE), Wolfgang Nagel (Dresden, DE), David Skinner (LBL) Jeffrey Vetter (ORNL)

That’s one of my favorite subsections. Expected increase in hardware and software stack complexity makes performance optimization a very complex task. Having millions or billions of threads working on the same problem requires different ways to measure and optimize performance. Authors believe that these areas are important in performance optimization for exascale systems: statistical profiling, techniques like automatic or automated analysis, advanced filtering techniques, on-line monitoring, clustering and analysis as well as data mining. Also authors believe that self-monitoring, self-tuning frameworks, middle ware, and runtime schedulers, especially at node levels, are necessary. Capturing system’s performance under constraints of power and reliability need to be radically changed. Significant overhead may take place to aggregate performance measurements and analyze them while system is running if not properly designed with the new tools. Authors believe that the complexity of exascale systems makes performance optimization in many configurations beyond humans’ manual abilities to monitor and optimize performance. They see that auto-tuning will be an important technique for performance optimization. Hence, authors believe that research in performance optimization should be directed to these areas:

  • Support for modeling, measurement, and analysis of heterogeneous hardware systems.
  • Support for modeling, measurement and analysis of hybrid programming models (mixing MPI, PGAS, OpenMP and other threading models, accelerator interfaces).
  • Automated / automatic diagnosis / autotuning.
  • Reliable and accurate performance analysis in presence of noise, system adaptation, and faults requires inclusion of appropriate statistical descriptions.
  • Performance optimization for other metrics than time (e.g. power).
  • Programming models should be designed with performance analysis in mind. Software and runtime systems must expose their model of execution and adaptation, and its corresponding performance through a (standardized) control mechanism in the runtime system.


Original contributors of this subsection are: Thomas Sterling (LSU), Hiroshi Nakashima (Kyoto U., JP)

Programmability of exascale systems is another critical factor for their success. It is quite difficult to benchmark it and find a baseline to set and measure our objectives in this area. However, authors identified the following basic challenges of systems’ programmability:

  • Massive parallelism though millions or billions of concurrent collaborating threads.
  • Huge number of distributed resources and difficulty of allocation and locality management.
  • Latency hiding by overlapping computations with communications.
  • Hardware Idiosyncrasies. Different models will emerge with significant differences in ISA, memory hierarchy, etc.
  • Portability. Application programs must be portable across machine types, machine scales, and machine generations.
  • Synchronization Bottlenecks of millions of threads trying to synchronize control or data access.
  • Data structures representation and distribution.

If you have read the other postings summarizing rest of these document, you will realize how complicated programmability is. It is cross cutting all the layers of the software stack, starting from ISA & operating systems and ending with applications. Going through author’s suggested research agenda, I found out that they are recommending all R&D directions proposed by the rest of the authors in their corresponding stack layer/component. I would recommend you to read the other related posting to realize challenges waiting for researchers to make exascale systems easier to program and utilize.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

I was teaching a microprocessors’ design internals course last fall semester. I planned at the beginning to give my students the opportunity to design a toy microprocessor and optimize important performance factors, such as the pipelining, branch prediction, instructions issuance, etc. However, I decided to link them to the industry and give them a project to implement a simple parallel algorithm on the Cell Broadband Engine and monitor critical performance factors. At the end my objective was to teach them through a real processor possible design tradeoffs and their effect on performance and general effectiveness of the microprocessor. So I had 21 teams of 4 or 5 students working on different discrete algorithms, such as sorting, prime checking, matrix multiplication, and Fibonacci calculations. I asked them to submit a report and write at end of it possible architectural improvements to boost the Cell processor’s performance from their project’s experiences. I found out some interesting conclusions that worth sharing here with you. I reworded some of these suggestions and added some details since they are extracted from a different context.

  • Instructions fetching unit inside the SPEs may suffer from starvation if we have a lot of DMA requests that must be served. This can take place because the high priority assigned to the DMA requests inside the SPE. IBM suggests balancing your source code and including IFETCH instruction to give instructions fetching unit time to fetch more instructions from the cache. Some students suggested including a separate instructions cache; this would make instructions fetching independent from the DMA requests or registers load/store instructions. This should solve this problem and avoid some of the coding complexity while programming the Cell. Also given most of the applications written on the Cell, the text size is relatively very small. So if 64KB cache for code is built inside the next generation of the Cell processor may boost performance and guarantee most of the time smooth instructions execution.
  • A lot of the vector intrinsics were for vectors of floats. Many operations were required for vectors of integers. Students had to type cast to floats before using many of the vector operations, which of course may provide inaccurate answers and consumes more time.
  • Of course the most commonly asked improvement is increasing the LS size inside each SPE. The main reason for some students is to include more buffers and utilize more the multi-buffering technique and better performance at the end.
  • Other students went wild and suggested to change the priorities of the DMA, IFETCH, and MEM operations within the SPEs. Instead of having them DMA>MEM>IFETCH, they suggested to invert them to avoid starvation of instructions fetch unit.
  • Another worth mentioning suggestion is to create memory allocation function that would guarantee allocation of large data structures to different memory banks, which would reduce the DMA latency. For example, if we need a large array and each range will be accessed by a different SPE, we can allocate this array into different memory banks to avoid hotspots inside memory while the SPEs are executing. It is already done by the IBM’s FFT implementation on the Cell processors.

Of course I filtered out some suggests that are of a common sense to any programmer, such as avoiding the 16 bytes memory alignment. I was impressed by their ability to understand and pinpoint some serious problems inside the CBE in less than 6 weeks period.

CSEN 702 Class: Thanks!

In this blog post I’m summarizing the two remaining subsections of the development environment: numerical libraries, and Debugging tools. In my last posting I summarized the first three subsections: programming models, frameworks, and compilers.

Numerical Libraries

Original Contributors are: Jack Dongarra (U. of Tennessee), Bill Gropp (UIUC), Mike Heroux (SNL), Anne Trefethen (Oxford U., UK) Aad van der Steen (NCF, NL)

Numerical libraries are necessary to separate HPC applications developers from a lot of architectural details. However, numerical libraries can be considered as reusable applications. Hence, their technology drivers are the common drivers mentioned in the programming model and systems area. Specifically, numerical libraries are driven by: hybrid architectures, programming models, accuracy, fault detection, energy budget, memory hierarchy and the relevant standards. Authors believe that numerical libraries have three main R&D alternatives:

  • Message Passing Libraries
  • Develop Global Address Space Libraries
  • Message Driven Work Queues

Authors believe that “the existing numerical libraries will need to be rewritten and extended in the light of the emerging architectural changes.” Therefore, the following should be the R&D agenda for the coming 10 years:

  • Develop hybrid and hierarchal-based libraries/applications.
  • Develop auto-tuning mechanisms inside these libraries to get best possible performance.
  • Consider fault oblivious and error tolerant implementations.
  • Link precision and performance with energy saving
  • Build algorithms that adapt with the architectural properties with minimal or no changes.
  • Build algorithms that minimize communication inside these parallel libraries.
  • Enhance current shared memory models as they will continue in parallel programming for many years to come.


Debugging Tools

Original contributors of this subsection are: David Skinner (LBL), Wolfgang Nagel (Dresden, DE).

Massive parallelism inside Exascale systems is making a big challenge for debugging tools. Debugging tools are built to help discovering unintended behavior inside applications and assist in solving it. However, inside applications running on thousands of cores with thousands of concurrent threads it is a very difficult task. Debuggers may have to look into the operating system and possible hardware interactions that may cause failures inside applications. The authors believe that the following are the technology drivers for the debugging tools:

  • Concurrency driven overhead in debugging
  • Scalability of debugger methodologies (data and interfaces)
  • Concurrency scaling of the frequency of external errors/failures
  • Heterogeneity and lightweight operating systems

The authors did not talk about R&D alternatives for debugging tools. However, they stressed on two possible improvement areas: (1) Include debugging tools and APIs at each layer, which can emit necessary debugging information to the upper layers, and (2) Integrate debugging information from all layers and provide different debugging contexts to pinpoint root causes of bugs. The authors then move into their suggested research agenda and propose these research areas for Exascale debugging tools:

  • Finding ways to cluster or group threads for easier monitoring and root cause identification.
  • Interoperate debugging tools with fault-tolerance mechanisms. This should guarantee continuous debugging, which makes it easier to find a bug that is very hard to reproduce.
  • Vertical integration of debug and performance information across software layers.
  • Automatically triggered debugging – Instead of debugging test cases in a separate session, some exascale debugging must be delivered just-in-time as problems unfold.


Next Time

In my next posting I will summarize the third section of this document, the applications. It is an interesting section for end users since it discusses challenges in algorithms, data analysis and visualization, and scientific data management in the exascale computing. It is a pragmatic perspective for potential exascale users.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

Welcome to part 4 of my series summarizing the exascale software roadmap document. This document is a produced through a series of meetings by scientists and researchers in different areas of HPC envisioning the software stack for million cores machines. That’s a machine that is due soon in this decade with exascale computing power. In last two blog posts I summarized the Systems Software, which is concerned of operating systems, run-time systems, I/O, and systems management. This blog posting and the next one is discussing exascale project vision about the development environment, which includes interesting topics, mainly: programming models, frameworks, compilers, numerical libraries, and debugging tools. I think this section is of great importance to both computer scientists and researchers from other fields of science. It is concerned of the direct tools to build and implement needed applications or algorithms. So let’s get started.

Programming Models

Original contributors of this section are: Barbara Chapman (U. of Houston), Mitsuhisa Sato, (U. of Tsukuba, JP), Taisuke Boku (U. of Tsukuba, JP), Koh Hotta, (Fujitsu), Matthias zueller (TU Dresden, DE), Xuebin Chi (Chinese Academy of Sciences)

Authors believe that 7 technology drivers will affect the programming models significantly in this decade:

  • Increased number of nodes and explosion in the number of cores in nodes which mandates from programming models to work at different granularity levels.
  • Heterogeneity of processors, which makes a basic task of the programming models to abstract such heterogeneity.
  • Increased number of components increases the likelihood of failures to occur. Programming models should be resilient to such failures.
  • Changing nature and trends in I/O usage push programming models to consider more seriously expected I/O complexities.
  • Applications’ complexity will increase dramatically. Programming models should simplify parallel programming and help developers focus on the application or algorithm implementation rather than architectural related concerns.
  • Increased depth of software stack mandates from the programming models to detect and report failures at the proper abstraction level.


Based on these foreseen drivers, the following R&D alternative are available for the community:

  • Hybrid versus Uniform programming model. Hybrid may provide better performance but very difficult to learn and use. Uniform programming models are easier to program with; however, their abstractions may reduce performance.
  • Domain specific versus general programming models. Domain specific may provide better portability and performance compared to the general models in some application areas.
  • Widely embraced standards versus single implementation. The second option is faster to implement but the first strategy would provide more support for the applications developers.

It is very difficult to decide which of these alternatives to choose. However, it is a fact right now that most of the HPC systems will be built out of heterogeneous architectures to accelerate the compute intensive parts within the applications. This will impose the usage of hybrid programming models such as MPI and OpenMP or MPI and CUDA. According to the authors, they key for a successful programming models development is to link existing models for faster and better productivity. Such integration may give corresponding community more ideas about building a new programming model that provides unified programming interface.


Original contributors of this section are: Michael Heroux and Robert Harrison

Frameworks should provide a common collection of interfaces, tools and capabilities that are reusable across a set of related applications. It is always a challenging task for HPC systems due to their inherit complexity. I think there is some redundancy in this section. The main technology drivers I could get from this section are:

  • New applications will be implemented on top of the exascale systems. Current frameworks should be revisited to satisfy the new possible needs.
  • Scalability and extensibility are very important factors that need reconsideration due to the hybrid systems a variability of applications as well.

According to the authors, we have two options in such case:

  • No Framework. In this case a single application can be developed faster. However, a lot of redundancy will exist if ware adopting that option for all applications running on top of the exascale infrastructure.
  • Clean-Slate Framework. It takes time to develop such frameworks. However, it depends on the other components of the exascale software stack. If a revolutionary option chosen in the other components (e.g. new OS, programming model, etc.), which is less likely to occur, a new framework will be required to link all these components together.

The authors are concluding by suggesting two main critical directions for a proper framework tying all the exascale software components together:

  1. Identify and develop cross-cutting algorithm and software technologies, which is relatively easy to do, based on the experiences of the last few years on the multi- and many-core architectures.
  2. Refactoring for manycore, which is doable by understanding the common requirements of manycore programming that will be true regardless of the final choice in programming models, such as load balancing, fault tolerance, etc.


Original contributors of this section are: Barbara Chapman (U. of Houston), Mitsuhisa Sato, (U. of Tsukuba, JP), Taisuke Boku (U.of Tsukuba, JP), Koh Hotta, (Fujitsu), Matthias Mueller (TU Dresden), Xuebin Chi (Chinese Academy of Sciences)

Compilers are a critical component in implementing the foreseen programming models. The following technology trends might be the main drivers for compilers design and development for the exascale software stack:

  • Machines will have hybrid processors. Compilers are expected to generate code and collaborate with run-time libraries working on different types of processors at the same time.
  • Memory hierarchies will be highly complex; memory will be distributed across the nodes of exascale systems and will be NUMA within the individual nodes, with many levels of cache and possibly scratchpad memory. Compilers will be expected to generate code that exhibits high levels of locality in order to minimize the cost of memory accesses.

Authors of this section are using the same R&D alternatives of the programming models for the compilers. Therefore, they are proposing the following research points for compilers (I’m including important ones):

  • Techniques for the translation of new exascale programming models and languages supporting high productivity and performance, support for hybrid programming models and for programming models that span heterogeneous systems.
  • Powerful optimization frameworks; implementing parallel program analyses and new, architecture-aware optimizations, including power, will be key to the efficient translation of exascale programs.
  • Exascale compilers could benefit from recent experiences with just-in-time compilation and perform online feedback-based optimizations, try out different optimizations, generate multiple code versions or perform more aggressive speculative optimizations.
  • Implement efficient techniques for fault tolerance.
  • Compilers should interact with the development tools run-time environment for automatically instrumenting tools.
  • Compilers may be able to benefit from auto-tuning approaches, may incorporate techniques for learning from prior experiences, exploit knowledge on suitable optimization strategies that is gained from the development and execution environments, and apply novel techniques that complement traditional translation strategies.

Next Time

My next blog post will be handling important two subsections: numerical libraries and debugging tools.


This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


In my last posting, I summarized three areas of the Systems Software chapter from the International Exascale Software Project (IESP). I continue in this posting and summarize for you the remaining two areas: System Management, and External Environment.

System Management

Original contributors of this section are: Robert Wisniewski (IBM) and Bill Kramer (NCSA)

The authors are dividing the systems management into five subareas that need reconsideration for the exascale computing:

  1. Resource control and scheduling, which includes configuring, start-up and reconfiguring the machine, defining limits for resource capacity and quality, provisioning the resources, and workflow management.
  2. Security, which includes authentication and authorization, integrity of the system, data integrity, and detecting anomalous behavior and inappropriate use.
  3. Integration and test, which involves managing and maintaining the health of the system and performing continuous diagnostics.
  4. Logging, reporting, and analyzing information.
  5. External coordination of resources, which is how the machine coordinates with external components.

Considering these areas and their implications on system management, the following will be the main technology drivers for the systems management:

  • All system management tasks, such as integrating new nodes, moving right data to the right place, and responding to security comprises, must be responsive. In other words, these tasks should be autonomous and proactive to reach proper response requirements.
  • Data movement will be constrained rather than processing time at the exascale computing. Hence, resource control and management – and the utilization logs for resources – has to change focus to communications and data movement.
  • Security management will be more complex. Variability of the system will impose building more security components and integrate them in many subsystems.
  • The effect of security policies on performance will be more significant due to expected exascale system’s complexity. Security tools should be redesigned with performance perspective in mind.

Authors are offering two R&D alternatives for the systems management. First one is to use evolutionary method and extend the terascale and petascale tools. This will result, according to the authors, a lot of inefficiencies in the exascale systems. The second alternative involves borrowing some techniques and policies from telecommunication and real-time systems, such as statistical learning.

The authors are then recommending the research agenda till year 2020. They wrote them in bullet format, which is nice to only list below:

Category 1) “Resource control and scheduling” and “External coordination of resources”

  • Need to better characterize and manage non-traditional resources such as power and I/O bandwidth
  • Determine how to manage and control communication resources – provision and control, different for HPC than WAN routing
  • Determine and model real-time aspects of Exascale system management and feedback for resource control
  • Develop techniques for dynamic provision under constant failure of components
  • Coordinated resource discovery and scheduling aligned

Category 2) “Security”

  • Fine grained authentication and authorization by function/resources
  • Security Verification for SW built from diverse components
  • Provide appropriate “Defense in depth” within systems without performance or scalability impact.
  • Develop security focused OS components in X-stack.
  • Assess and improve end-to-end data integrity.
  • Determine guidelines and tradeoffs of security and openness (e.g. grids).

Category 3) “Integration and test” and “Logging, reporting, and analyzing information”

  • Determine key elements for Exascale monitoring
  • Continue mining current and future Petascale failure data to detect patterns and improvements
  • Determine methods for continuous monitoring and testing without affecting system behavior
  • Investigate improves information filters; provide stateful filters for predicting potential incorrect behavior
  • Determine statistical and data models that accurately capture system behavior
  • Determine proactive diagnostic and testing tools

External Environment

This section is to be filled in the document. But the document is setting the scope of the external environment to refer to the essential interfaces to remote computational resources (e.g. data repositories, real time data streams, high performance networks, and computing clouds).

I will keep an eye on the newer versions of this document and update this section if contributors are chosen for this task.

Next Time

We are done finally with the systems software layer. It is complicated layer but very critical for the success of the exascale software project. This layer is tying all system components together and makes them easier to use by the next layer, the programming and execution models.

My next posting will be summarizing the Development Environment section. It will discuss the technology drivers, upcoming challenges for the exascale systems, and recommended research agenda for the components of the development environment.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

The first layer that should be considered is the systems software. This posting has interesting points gathered from the International Exascale Software Project (IESP) Roadmap document, specifically the systems software section.

Systems software was identified as one of the paths to the new software stack of million cores machines. Systems software consists of four main areas: (1) Operating Systems, (2) Run-Time Systems, (3) I/O Systems, (4) Systems Management and (4) External Environment.

In this posting I will be summarizing the first three areas: (1) Operating Systems, (2) Run-Time Systems, and (3) I/O Systems

Operating Systems

Original content of this section contributed by: Barney Maccabe (ORNL), Pete Beckman (ANL), Fred Johnson (DOE).

It starts by discussing the technology drivers for operating systems in exascale era:

  1. Resources that operating systems will be responsible to manage effectively will get more complex. For example the increasing number of cores and heterogeneity of these cores will make effective management of shared bus and memory critical factors of system performance.
  2. There will be an increasing emphasis on data-centric computations and that programming models will continue to emphasize the management of distributed memory resources.
  3. Multiple programming models may be used within a single program, which mandates from operating systems to provide common APIs in addition to architecture specific ones.

Given these trends, the authors are suggesting two operating systems R&D alternatives to bridge the gap between rapid changes in hardware platforms and old operating systems:

  1. Develop from scratch operating systems for many-core machines, which will require huge effort and might be impractical given efforts and industry reliance on current operating systems.
  2. Evolving existing operating systems, which are burdened with old design concepts. However, it is easier to adapt this option.

It is likely that operating systems will evolve gradually to adopt the new scope of resources management. Development efforts will start by defining a framework for HPC systems, which should take place in years 2010 and 2011. Contributors believe the following areas should be researched actively:

  • Fault tolerant/masking strategies for collective OS services
  • Strategies and mechanisms for power/energy management
  • Strategies for simulating full-scale systems


Run-Time Systems

Original contributors of this section are: Jesus Labarta (BSC, ES), Rajeev Thakur (ANL), Shinji Sumimoto (Fujitsu)

The authors believe that “The design of tomorrow’s runtime systems will be driven not only by dramatic increases in overall system hierarchy and high variability in the performance and availability of hardware components, but also by the expected diversity application characteristics, the multiplicity of different types of devices, and the large latencies caused by deep memory subsystems.” Such drivers will impose two important run-time systems design considerations: (1) power/energy constraints, and (2) application development cost. In other words, run-time systems can provide fairly accurate picture of the resources utilization, such ability makes it possible for run-time systems to get the best performance/power ratio in such massively parallel systems. Accordingly, there are two R&D alternatives for the run-time systems:

  1. Flat Model run-time Systems, which uses message passing regardless of the target thread location (e.g. within the same node or at another node)
  2. Hierarchal Model Run-Time Systems, which combines shared memory and message passing according to different run-time parameters, such as the message size, frequency of communication, etc.


Based on these alternatives and the technology drivers for the run-time systems, it is recommended to work on four priority research directions:

  • Heterogeneity. Run-time systems should abstract the heterogeneity of architecture and make applications portable to different architectures.
  • Load Balance. “Research in this direction will result in self-tuned runtimes that will counteract at fine granularity unforeseen variability in application load and availability and performance of resources, thus reducing the frequency at which more expensive application-level rebalancing approaches will have to be used.”
  • Flat Run-Times. Run-time systems should be scalable to the expected number of cores while optimizing all run-time services such as message passing, synchronization, etc.
  • Hierarchical/hybrid runtimes. How run-times can be mapped to the semantics of different architectures without losing performance and keeping a unified semantics across different platforms. This may motivate researches to experiment on different hierarchical integrations of runtimes to support models, such as MPI+other threading or task based models, threading models+accelerators, MPI+threading+accelerators, MPI+PGAS, and hierarchical task-based models with very different task granularities at each level.


I/O Systems

The original contributors of this section are: Alok Choudhary (Northwestern U.), Yutaka Ishikawa (U. of Tokyo, JP)

The authors believe that because I/O systems were designed as separate independent components from the compute infrastructure, they have already shown not to be scalable as needed. Therefore, “emerging storage devices such as solid-state disks (SSDs) or Storage Class Memories (SCM) have the potential to significantly alter the I/O architectures, systems, performance and the software system to exploit them. These emerging technologies also have significant potential to optimize power consumption. Resiliency of an application under failures in an exascale system will depend significantly on the I/O systems, its capabilities, capacity and performance because saving the state of the system in the form of checkpoints is likely to continue as one of the approaches.”

Based on these technology changes, the authors see the following possible research areas in I/O systems:

  • Delegation and Customization within I/O Middleware. Doing customization within the user space is a very good option since information about the data semantics and usage pattern can be captured effectively at this level. This should be done not for single process but across maybe all processes utilizing a single system. These middleware layers can utilize such information in intelligent and proactive caching, data reorganization, optimizations, smoothening of I/O accesses from bursty to smooth patterns.
  • Active Storage and Online Analysis. Active storage involves utilizing available compute resources to perform data analysis, organization, redistribution, etc. Online analysis can reduce storage needs through storing meta data about the stored data and possible regenerate it when acquired.
  • Purpose-Driven I/O Software Layers. I/O systems will be aware of how data will be used and accordingly data will be stored and index.
  • Software Systems for Integration of Emerging Storage Devices. Research and development of newer I/O models, and different layers of software systems including file system and middleware would be very important for the exploitation of these devices.
  • Extend Current File Systems.
  • Develop New Approach to Scalable Parallel File Systems.
  • Incorporate I/O into Programming Models and Languages. Integration would make it easier to predict the storage or reading pattern and accordingly build more efficient mechanisms, such as I/O caching, scheduling, pipelining, etc.
  • Wide-Area I/O and integration of external Storage Systems.


Next Time

In my next posting will summarize the other two areas falling under the systems software: Systems Management, and External Environments. Meanwhile, tell me what do you think about these areas as potential research directions for HPC systems working on million cores machines. Do you think that these changes will take place in coming 10 years? Does your research area fall under any of them? Would you like to add more to these directions?

This posting is part of a series summarizing the roadmap document of the Exascale Software Project: