In my last posting, I summarized three areas of the Systems Software chapter from the International Exascale Software Project (IESP). I continue in this posting and summarize for you the remaining two areas: System Management, and External Environment.

System Management

Original contributors of this section are: Robert Wisniewski (IBM) and Bill Kramer (NCSA)

The authors are dividing the systems management into five subareas that need reconsideration for the exascale computing:

  1. Resource control and scheduling, which includes configuring, start-up and reconfiguring the machine, defining limits for resource capacity and quality, provisioning the resources, and workflow management.
  2. Security, which includes authentication and authorization, integrity of the system, data integrity, and detecting anomalous behavior and inappropriate use.
  3. Integration and test, which involves managing and maintaining the health of the system and performing continuous diagnostics.
  4. Logging, reporting, and analyzing information.
  5. External coordination of resources, which is how the machine coordinates with external components.

Considering these areas and their implications on system management, the following will be the main technology drivers for the systems management:

  • All system management tasks, such as integrating new nodes, moving right data to the right place, and responding to security comprises, must be responsive. In other words, these tasks should be autonomous and proactive to reach proper response requirements.
  • Data movement will be constrained rather than processing time at the exascale computing. Hence, resource control and management – and the utilization logs for resources – has to change focus to communications and data movement.
  • Security management will be more complex. Variability of the system will impose building more security components and integrate them in many subsystems.
  • The effect of security policies on performance will be more significant due to expected exascale system’s complexity. Security tools should be redesigned with performance perspective in mind.

Authors are offering two R&D alternatives for the systems management. First one is to use evolutionary method and extend the terascale and petascale tools. This will result, according to the authors, a lot of inefficiencies in the exascale systems. The second alternative involves borrowing some techniques and policies from telecommunication and real-time systems, such as statistical learning.

The authors are then recommending the research agenda till year 2020. They wrote them in bullet format, which is nice to only list below:

Category 1) “Resource control and scheduling” and “External coordination of resources”

  • Need to better characterize and manage non-traditional resources such as power and I/O bandwidth
  • Determine how to manage and control communication resources – provision and control, different for HPC than WAN routing
  • Determine and model real-time aspects of Exascale system management and feedback for resource control
  • Develop techniques for dynamic provision under constant failure of components
  • Coordinated resource discovery and scheduling aligned

Category 2) “Security”

  • Fine grained authentication and authorization by function/resources
  • Security Verification for SW built from diverse components
  • Provide appropriate “Defense in depth” within systems without performance or scalability impact.
  • Develop security focused OS components in X-stack.
  • Assess and improve end-to-end data integrity.
  • Determine guidelines and tradeoffs of security and openness (e.g. grids).

Category 3) “Integration and test” and “Logging, reporting, and analyzing information”

  • Determine key elements for Exascale monitoring
  • Continue mining current and future Petascale failure data to detect patterns and improvements
  • Determine methods for continuous monitoring and testing without affecting system behavior
  • Investigate improves information filters; provide stateful filters for predicting potential incorrect behavior
  • Determine statistical and data models that accurately capture system behavior
  • Determine proactive diagnostic and testing tools

External Environment

This section is to be filled in the document. But the document is setting the scope of the external environment to refer to the essential interfaces to remote computational resources (e.g. data repositories, real time data streams, high performance networks, and computing clouds).

I will keep an eye on the newer versions of this document and update this section if contributors are chosen for this task.

Next Time

We are done finally with the systems software layer. It is complicated layer but very critical for the success of the exascale software project. This layer is tying all system components together and makes them easier to use by the next layer, the programming and execution models.

My next posting will be summarizing the Development Environment section. It will discuss the technology drivers, upcoming challenges for the exascale systems, and recommended research agenda for the components of the development environment.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project: