The first layer that should be considered is the systems software. This posting has interesting points gathered from the International Exascale Software Project (IESP) Roadmap document, specifically the systems software section.

Systems software was identified as one of the paths to the new software stack of million cores machines. Systems software consists of four main areas: (1) Operating Systems, (2) Run-Time Systems, (3) I/O Systems, (4) Systems Management and (4) External Environment.

In this posting I will be summarizing the first three areas: (1) Operating Systems, (2) Run-Time Systems, and (3) I/O Systems

Operating Systems

Original content of this section contributed by: Barney Maccabe (ORNL), Pete Beckman (ANL), Fred Johnson (DOE).

It starts by discussing the technology drivers for operating systems in exascale era:

  1. Resources that operating systems will be responsible to manage effectively will get more complex. For example the increasing number of cores and heterogeneity of these cores will make effective management of shared bus and memory critical factors of system performance.
  2. There will be an increasing emphasis on data-centric computations and that programming models will continue to emphasize the management of distributed memory resources.
  3. Multiple programming models may be used within a single program, which mandates from operating systems to provide common APIs in addition to architecture specific ones.

Given these trends, the authors are suggesting two operating systems R&D alternatives to bridge the gap between rapid changes in hardware platforms and old operating systems:

  1. Develop from scratch operating systems for many-core machines, which will require huge effort and might be impractical given efforts and industry reliance on current operating systems.
  2. Evolving existing operating systems, which are burdened with old design concepts. However, it is easier to adapt this option.

It is likely that operating systems will evolve gradually to adopt the new scope of resources management. Development efforts will start by defining a framework for HPC systems, which should take place in years 2010 and 2011. Contributors believe the following areas should be researched actively:

  • Fault tolerant/masking strategies for collective OS services
  • Strategies and mechanisms for power/energy management
  • Strategies for simulating full-scale systems


Run-Time Systems

Original contributors of this section are: Jesus Labarta (BSC, ES), Rajeev Thakur (ANL), Shinji Sumimoto (Fujitsu)

The authors believe that “The design of tomorrow’s runtime systems will be driven not only by dramatic increases in overall system hierarchy and high variability in the performance and availability of hardware components, but also by the expected diversity application characteristics, the multiplicity of different types of devices, and the large latencies caused by deep memory subsystems.” Such drivers will impose two important run-time systems design considerations: (1) power/energy constraints, and (2) application development cost. In other words, run-time systems can provide fairly accurate picture of the resources utilization, such ability makes it possible for run-time systems to get the best performance/power ratio in such massively parallel systems. Accordingly, there are two R&D alternatives for the run-time systems:

  1. Flat Model run-time Systems, which uses message passing regardless of the target thread location (e.g. within the same node or at another node)
  2. Hierarchal Model Run-Time Systems, which combines shared memory and message passing according to different run-time parameters, such as the message size, frequency of communication, etc.


Based on these alternatives and the technology drivers for the run-time systems, it is recommended to work on four priority research directions:

  • Heterogeneity. Run-time systems should abstract the heterogeneity of architecture and make applications portable to different architectures.
  • Load Balance. “Research in this direction will result in self-tuned runtimes that will counteract at fine granularity unforeseen variability in application load and availability and performance of resources, thus reducing the frequency at which more expensive application-level rebalancing approaches will have to be used.”
  • Flat Run-Times. Run-time systems should be scalable to the expected number of cores while optimizing all run-time services such as message passing, synchronization, etc.
  • Hierarchical/hybrid runtimes. How run-times can be mapped to the semantics of different architectures without losing performance and keeping a unified semantics across different platforms. This may motivate researches to experiment on different hierarchical integrations of runtimes to support models, such as MPI+other threading or task based models, threading models+accelerators, MPI+threading+accelerators, MPI+PGAS, and hierarchical task-based models with very different task granularities at each level.


I/O Systems

The original contributors of this section are: Alok Choudhary (Northwestern U.), Yutaka Ishikawa (U. of Tokyo, JP)

The authors believe that because I/O systems were designed as separate independent components from the compute infrastructure, they have already shown not to be scalable as needed. Therefore, “emerging storage devices such as solid-state disks (SSDs) or Storage Class Memories (SCM) have the potential to significantly alter the I/O architectures, systems, performance and the software system to exploit them. These emerging technologies also have significant potential to optimize power consumption. Resiliency of an application under failures in an exascale system will depend significantly on the I/O systems, its capabilities, capacity and performance because saving the state of the system in the form of checkpoints is likely to continue as one of the approaches.”

Based on these technology changes, the authors see the following possible research areas in I/O systems:

  • Delegation and Customization within I/O Middleware. Doing customization within the user space is a very good option since information about the data semantics and usage pattern can be captured effectively at this level. This should be done not for single process but across maybe all processes utilizing a single system. These middleware layers can utilize such information in intelligent and proactive caching, data reorganization, optimizations, smoothening of I/O accesses from bursty to smooth patterns.
  • Active Storage and Online Analysis. Active storage involves utilizing available compute resources to perform data analysis, organization, redistribution, etc. Online analysis can reduce storage needs through storing meta data about the stored data and possible regenerate it when acquired.
  • Purpose-Driven I/O Software Layers. I/O systems will be aware of how data will be used and accordingly data will be stored and index.
  • Software Systems for Integration of Emerging Storage Devices. Research and development of newer I/O models, and different layers of software systems including file system and middleware would be very important for the exploitation of these devices.
  • Extend Current File Systems.
  • Develop New Approach to Scalable Parallel File Systems.
  • Incorporate I/O into Programming Models and Languages. Integration would make it easier to predict the storage or reading pattern and accordingly build more efficient mechanisms, such as I/O caching, scheduling, pipelining, etc.
  • Wide-Area I/O and integration of external Storage Systems.


Next Time

In my next posting will summarize the other two areas falling under the systems software: Systems Management, and External Environments. Meanwhile, tell me what do you think about these areas as potential research directions for HPC systems working on million cores machines. Do you think that these changes will take place in coming 10 years? Does your research area fall under any of them? Would you like to add more to these directions?

This posting is part of a series summarizing the roadmap document of the Exascale Software Project: