exascale computing



Welcome to the sixth part summarizing the roadmap document of the international exascale software project (IESP). This part is moving higher in the software stack and discussed the challenges, technology drivers and suggested research agenda for the exascale applications. It focuses on algorithms, data analysis and visualization, and scientific data management.

Algorithms

Original authors of this subsection are: Bill Gropp (UIUC), Fred Streitz (LLNL), Mike Heroux (SNL), Anne Trefethen (Oxford U., UK)

Authors see the following important drivers for the exascale systems algorithms:

  • Scalability to millions or billions of threads working collaboratively or independently.
  • Fault tolerance and fault resilience. Algorithms should be able to detect faults inside systems and recover from them with least possible resources away from the current traditional check pointing techniques.
  • Match algorithm’s compute demands with available capabilities. Authors are mainly focusing on power aware algorithms. This is attainable by considering accurate performance modeling, utilizing heterogeneous architectures, awareness of memory hierarchy and performance constraints of each component inside the exascale systems.

Authors are then discussing the critical challenges. However, they stress the importance of developing performance models against which algorithm developments can be evaluated.

I could not find in this section clear research agenda. However, authors are suggesting the following critical challenges for applications in the era of exascale computing:

  • Gap analysis – need to perform a detailed analysis of the applications, particularly with respect to quantitative models of performance and scalability.
  • Scalability, particularly relaxing synchronization constraints
  • Fault tolerance and resilience, including fault detection and recovery
  • Heterogeneous systems – algorithms that are suitable for systems made of functional units with very different abilities

Data Analysis and Visualization

Original contributors of this subsection are: Michael E. Papka (ANL), Pete Beckman (ANL), Mark Hereld (ANL), Rick Stevens (ANL), John Taylor(CSIRO, Australia)

According to Wikipedia Visualization is the transformation, selection or representation of data from simulations or experiments, with an implicit or explicit geometric structure, to allow the exploration, analysis and understanding of the data. As simulations become more complex generating more data, visualization is gaining more importance in scientific and high performance computing. The authors accordingly believe that the following are the possible alternative R&D options:

  • Develop new analysis and visualization algorithms to fit well within the new large and complex architectures of exascale computing.
  • Identify new mathematical and statistical approaches for data analysis
  • Develop integrated adaptive techniques to enable on the fly and learned pattern performance optimization from fine to coarse grain.
  • Expand the role of supporting visualization environments to include more pro-active software: model and goal aware agents, estimated and fuzzy results, and advanced feature identification.
  • would be advantageous to invest in methods and tools for
  • Keep track of the processes and products of exploration and discovery. These will include aids to process navigation, hypothesis tracking, workflows, provenance tracking, and advanced collaboration and sharing tools.
  • Plan deployment of global system of large scale high resolution (100 Mpixel) visualization and data analysis systems based on open source architecture to link universities and research laboratories

Hence, authors are shooting for the following R&D strategy to make data analysis and visualization develop at the right pace of developing the new exascale systems:

  • Identification of features of interest in exabytes of data
  • Visualization of streams of exabytes of data from scientific instruments
  • Integrating simulation, analysis and visualization at the exascale

Scientific Data Management

Original contributor of this subsection is: Alok Choudhary (Northwestern U.)

Managing scientific data has been identified by the scientific community as one of the most important emerging needs because of the sheer volume and increasing complexity of data. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data. Although file systems are scalable now to store huge number of files and scientific data, this is considered not enough for the scientific computing community. Another layer should be built on top of these file systems to provide easy tools to store, retrieve and analyze huge data sets generated by the scientific analysis algorithms.

Accordingly, the author is suggesting considering these alternatives and R&D strategies to build effective scientific data management tools and frameworks:

  • Building new data analysis and mining tools for knowledge discovery from massive datasets produced and/or collected.
  • Designing scalable workflow tools with easy-to-use interfaces would be very important for exascale systems both for performance and productivity of scientists as well as effective use of these systems.
  • Investigate new approaches to build database systems for scientific computing that scale in performance, usability, query, data modeling and an ability to incorporate complex data types in scientific applications; and that eliminate the over-constraining usage models which are impediments to scalability in traditional databases.
  • Design new Scalable Data Format and High-level Libraries without binding these efforts with the backward compatibility. Exascale systems will use different I/O architectures and different processing paradigm. Spending significant effort on backward compatibility would put these systems’ viability on the edge.
  • New technology is required for efficient and scalable searching and filtering of large-scale scientific multivariate datasets with hundreds of searchable attributes to deliver the most relevant data and results would be important.
  • Wide-area data access is becoming an increasingly important part of many scientific workflows. In order to most seamlessly interact with wide-area storage systems, tools must be developed that can span various data management techniques across wide area integrated with scalable I/O, workflow tools, query and search techniques.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:

Advertisements

This is the final post summarizing the fourth section of the IESP Roadmap document. It is discussing some important crosscutting dimensions of the upcoming exascale systems. It discusses areas of a concern of all users and engineers of exascale system. This section focuses on: (1) Resilience, (2) Power Management, (3) Performance Optimization, and (4) Programmability. Although they are critical it is very difficult to study them independently from all other components of the exascale systems. I think they should be integral parts of each software layer in next generation of HPC systems. They are already thought of in current and past HPC systems, but they are done at very limited scale. For example, power management is considered only at the OS layer and performance optimization at the application level.

Resilience

Original contributors of this subsection are: Franck Cappello (INRIA, FR), Al Geist (ORNL) Sudip Dosanjh (SNL), Marc Snir (UIUC), Bill Gropp (UIUC), Sanjay Kale (UIUC), Bill Kramer (NCSA), Satoshi Matsuoka (TITECH), David Skinner (NERSC)

Before summarizing this section I followed some of the authors and found out an interesting white paper by same authors, except Satoshi Matsuoka, discussing software resilience in more depth. I highly recommend you to read it. Its title is: Toward Exascale Resilience, and you can find it here.

The main upcoming challenge in building resilient systems for the era of exascale computing is the inapplicability of traditional checkpoint/restart techniques. Having millions of threads would consume considerable time, space, and emerging to checkpoint their states. New resilience techniques are required to minimize overheads of resilience. Given this general picture, authors believe the following would be the main drivers of R&D in resilient exascale computing:

  • The increased number of components (hardware and software components) will increase the likelihood of having failures even in short execution tasks.
  • Silent soft errors will become significant and raise the issues of result and end-to-end data correctness.
  • New storage and memory technologies, such as the SSD and phase change memory, will bring with them great opportunities for faster and more efficient state management and check-pointing.

I recommend you to read the authors’ original white paper to read more about these challenges.

Authors also did a quick gap analysis to quickly pinpoint in more details areas of fault-tolerance that need rethinking. Among these points:

  • The most common programming model, MPI, does not offer a paradigm for resilient programming.
  • Most of the present applications and system software are not fault tolerant nor fault aware and are not designed to confine errors/faults.
  • Software layers are lacking communication and coordination to handle faults at different levels inside the system.
  • Deeper analysis of the root causes of different faults is mandatory to find efficient solutions.
  • Efficient verifications of global results from long executions are missing as well.
  • Standardized matrices to measure and compare resilience of different applications against are missing so far.

Authors see many possibilities and a lot of complexities to reach resilient exascale systems. However, they conclude of focusing research in two main threads:

  • Extend the applicability of rollback toward more local recovery.
  • Fault avoidance and fault oblivious software to limit the recovery from rollback.

Power Management

Original contributors of this subsection are: John Shalf (LBNL), Satoshi Matsuoka (TITECH, JP)

Power management for exascale systems is to keep best attainable performance with minimum power consumption. This comprises allocating power to system components actively involved in application or algorithm execution. According to the authors, existing power management infrastructure has been derived from consumer electronic devices, and fundamentally never had large-scale systems in mind. Existence of cross-cutting power management infrastructure is mandatory. Absence of such infrastructure will force the reduction of exascale systems scale and feasibility. For large HPC systems power is part of the total-cost-of-ownership. It will be a critical part of exascale systems management. Accordingly, authors are proposing two alternative R&D strategies:

  • Power down components when they are underutilized. For example, the OS can reduce the frequency and operating voltage of a hardware component when it is not used for relatively long time.
  • Explicitly manage data movement, which is simply avoid unnecessary data movement. This should reduce power consumption in networks, hard-disks, memory, etc.

Authors suggest five main research areas for effective power management inside exascale systems:

  • OS based power management. Authors believe that two changes should be considered: (1) Fair shared resources management among hundreds or thousands of processors on the same machine, (2) Ability to manage power levels for heterogeneous architectures inside the same machine, such as GPGPUs
  • System-Scale Resource Management. Standard interfaces need to be developed allowing millions of cores work in complete synchrony to implement effective power management policies.
  • Algorithms. Power aware algorithms are simply those algorithms that would reduce communication overhead for each FLOP. Libraries should be considered to articulate the tradeoffs between communication, power, and FLOPs.
  • Libraries. According to the authors, library designers need to use their domain-specific knowledge of the algorithm to provide power management and policy hints to the power management infrastructure.
  • Compilers. Compilers should make it easier to program for power management by automatically instrument code for power management.
  • Applications. Applications should provide power aware systems and libraries hints about their power related policies for best power optimization.

Given these possible research areas going across the whole software stack, authors believe that the following should be the key metrics to get effectively manage power consumption of exascale systems:

  • Performance. Ability to predict execution pattern inside applications would help in reducing power consumption while attaining the best possible performance.
  • Programmability. Applications developers are not expected to do power management explicitly inside their applications. Coordination between all layers of the software stack should be possible for power management.
  • Composability. Power management components built by different teams should be able to work in harmony when it comes to power management.
  • Scalability, which requires integration of power management information for system wide power management policies.

Performance Optimization

Original contributors of this subsection are: Brend Mohr (Juelich, DE), Adolfy Hoisie (LANL), Matthias Mueller (TU Dresden, DE), Wolfgang Nagel (Dresden, DE), David Skinner (LBL) Jeffrey Vetter (ORNL)

That’s one of my favorite subsections. Expected increase in hardware and software stack complexity makes performance optimization a very complex task. Having millions or billions of threads working on the same problem requires different ways to measure and optimize performance. Authors believe that these areas are important in performance optimization for exascale systems: statistical profiling, techniques like automatic or automated analysis, advanced filtering techniques, on-line monitoring, clustering and analysis as well as data mining. Also authors believe that self-monitoring, self-tuning frameworks, middle ware, and runtime schedulers, especially at node levels, are necessary. Capturing system’s performance under constraints of power and reliability need to be radically changed. Significant overhead may take place to aggregate performance measurements and analyze them while system is running if not properly designed with the new tools. Authors believe that the complexity of exascale systems makes performance optimization in many configurations beyond humans’ manual abilities to monitor and optimize performance. They see that auto-tuning will be an important technique for performance optimization. Hence, authors believe that research in performance optimization should be directed to these areas:

  • Support for modeling, measurement, and analysis of heterogeneous hardware systems.
  • Support for modeling, measurement and analysis of hybrid programming models (mixing MPI, PGAS, OpenMP and other threading models, accelerator interfaces).
  • Automated / automatic diagnosis / autotuning.
  • Reliable and accurate performance analysis in presence of noise, system adaptation, and faults requires inclusion of appropriate statistical descriptions.
  • Performance optimization for other metrics than time (e.g. power).
  • Programming models should be designed with performance analysis in mind. Software and runtime systems must expose their model of execution and adaptation, and its corresponding performance through a (standardized) control mechanism in the runtime system.

Programmability

Original contributors of this subsection are: Thomas Sterling (LSU), Hiroshi Nakashima (Kyoto U., JP)

Programmability of exascale systems is another critical factor for their success. It is quite difficult to benchmark it and find a baseline to set and measure our objectives in this area. However, authors identified the following basic challenges of systems’ programmability:

  • Massive parallelism though millions or billions of concurrent collaborating threads.
  • Huge number of distributed resources and difficulty of allocation and locality management.
  • Latency hiding by overlapping computations with communications.
  • Hardware Idiosyncrasies. Different models will emerge with significant differences in ISA, memory hierarchy, etc.
  • Portability. Application programs must be portable across machine types, machine scales, and machine generations.
  • Synchronization Bottlenecks of millions of threads trying to synchronize control or data access.
  • Data structures representation and distribution.

If you have read the other postings summarizing rest of these document, you will realize how complicated programmability is. It is cross cutting all the layers of the software stack, starting from ISA & operating systems and ending with applications. Going through author’s suggested research agenda, I found out that they are recommending all R&D directions proposed by the rest of the authors in their corresponding stack layer/component. I would recommend you to read the other related posting to realize challenges waiting for researchers to make exascale systems easier to program and utilize.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


In this blog post I’m summarizing the two remaining subsections of the development environment: numerical libraries, and Debugging tools. In my last posting I summarized the first three subsections: programming models, frameworks, and compilers.

Numerical Libraries

Original Contributors are: Jack Dongarra (U. of Tennessee), Bill Gropp (UIUC), Mike Heroux (SNL), Anne Trefethen (Oxford U., UK) Aad van der Steen (NCF, NL)

Numerical libraries are necessary to separate HPC applications developers from a lot of architectural details. However, numerical libraries can be considered as reusable applications. Hence, their technology drivers are the common drivers mentioned in the programming model and systems area. Specifically, numerical libraries are driven by: hybrid architectures, programming models, accuracy, fault detection, energy budget, memory hierarchy and the relevant standards. Authors believe that numerical libraries have three main R&D alternatives:

  • Message Passing Libraries
  • Develop Global Address Space Libraries
  • Message Driven Work Queues

Authors believe that “the existing numerical libraries will need to be rewritten and extended in the light of the emerging architectural changes.” Therefore, the following should be the R&D agenda for the coming 10 years:

  • Develop hybrid and hierarchal-based libraries/applications.
  • Develop auto-tuning mechanisms inside these libraries to get best possible performance.
  • Consider fault oblivious and error tolerant implementations.
  • Link precision and performance with energy saving
  • Build algorithms that adapt with the architectural properties with minimal or no changes.
  • Build algorithms that minimize communication inside these parallel libraries.
  • Enhance current shared memory models as they will continue in parallel programming for many years to come.

         
     

Debugging Tools

Original contributors of this subsection are: David Skinner (LBL), Wolfgang Nagel (Dresden, DE).

Massive parallelism inside Exascale systems is making a big challenge for debugging tools. Debugging tools are built to help discovering unintended behavior inside applications and assist in solving it. However, inside applications running on thousands of cores with thousands of concurrent threads it is a very difficult task. Debuggers may have to look into the operating system and possible hardware interactions that may cause failures inside applications. The authors believe that the following are the technology drivers for the debugging tools:

  • Concurrency driven overhead in debugging
  • Scalability of debugger methodologies (data and interfaces)
  • Concurrency scaling of the frequency of external errors/failures
  • Heterogeneity and lightweight operating systems

The authors did not talk about R&D alternatives for debugging tools. However, they stressed on two possible improvement areas: (1) Include debugging tools and APIs at each layer, which can emit necessary debugging information to the upper layers, and (2) Integrate debugging information from all layers and provide different debugging contexts to pinpoint root causes of bugs. The authors then move into their suggested research agenda and propose these research areas for Exascale debugging tools:

  • Finding ways to cluster or group threads for easier monitoring and root cause identification.
  • Interoperate debugging tools with fault-tolerance mechanisms. This should guarantee continuous debugging, which makes it easier to find a bug that is very hard to reproduce.
  • Vertical integration of debug and performance information across software layers.
  • Automatically triggered debugging – Instead of debugging test cases in a separate session, some exascale debugging must be delivered just-in-time as problems unfold.

   
 

Next Time

In my next posting I will summarize the third section of this document, the applications. It is an interesting section for end users since it discusses challenges in algorithms, data analysis and visualization, and scientific data management in the exascale computing. It is a pragmatic perspective for potential exascale users.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


I’m only summarizing here some sections from the Roadmap document of the International Extascale Software Project. This project has on board well-known scientists and professionals shaping the next software stack for machines with millions of cores with exascale computing power. This document is aggregating thoughts and ideas about how HPC and multi-core community should remodel the software stack. Machines with millions of cores will be a reality in few years if Moore’s law continues to work (e.g. doubling the number of cores inside microprocessors every 18 months and selling them 50% cheaper). HPC community should make serious efforts to be ready for such massively parallel machines. Authors believe that this is best achieved if we focus our efforts on understanding expected architectural developments and possible applications that will run on these platforms. The roadmap document is defining four main paths to the exascale software stack:

  1. Systems software, which includes: operating systems, I/O systems, systems management, and external environments.
  2. Development environments, which includes: programming models, frameworks, compilers, numerical libraries, and debugging tools.
  3. Applications, which includes: algorithms, data analysis and visualization, and scientific data management
  4. Cross Cutting Dimensions, which include important components, such as power management, performance optimization, and programmability.

Suggestions of changes in these paths depend on some important technology trends in HPC platforms:

  1. Technology Trends:
    1. Concurrency: processors’ concurrency will be very large and finer-grained
    2. Reliability: increased number of transistors and functional units will impose a reliability challenge
    3. Power consumption: it will increase from 10 Mega Watts these days to be close to 100 Mega Watts around 2020.
  2. Science Trends. Many research and governmental agencies specified upcoming scientific challenges that HPC can help as we cross into the exascale era, major research areas: climate, high-energy physics, nuclear physics, fusion energy sciences, nuclear energy, biology, materials science and chemistry, and national nuclear security
  3. Politico-economic trends: HPC is growing 11% every year. Main demands come from governments and large research institutions. It will continue this way since a lot of commercial entities depend on capacity rather than computing power to expand their infrastructures.

           
     

I’m interested in quickly sharing with you some key points from each of these paths. I think this document is more visionary compared to Berkeley’s former document: The Landscape of Parallel Computing Research. My next posting will be summarizing the first path of systems software.

This posting is part of a series summarizing the roadmap document of the Exascale Software Project:


I’m talking here again about the multi-core processors for massively parallel systems working on complex scientific applications. However, I’m tackling this area from a different perspective. I would like to think with you of how multi-core processors will look like in five years from now and think of these questions: What are the serious problems that these processors will suffer from (from system’s perspective)? Which of the current solutions or anticipated frameworks may help us solving these problems? I’ll be discussing only one of them here. It is very difficult to predict accuratly what technology advancements will take place in the comming five years. However, there are general trends that we can track and resonably predict their future effects.

Anyways, I mentioned before that multi-core processors will be of thousands of cores maybe even tens of thousands of cores (check the latest AMD GPGPU Radeon HD 5970). These cores will be very simple and solving the same problem but on different data chunks. This of course mandates the existance of shared resources and shared areas contain input data and be able to store results. The path from one core to the data storage and other shared resources will get more complex and involving more shared resources, such as more hirarchies of on-chip and off-chip caches, cores interconnections, I/O buses, etc. The anticipated path and hierarchies through which data will be traveling to reach sysem’s main memory or to rech core’s registers will have a very important effect on data movement latency. It is not about only having higer latency which can be hidden by many veified techniques. It is about the variance of this latency from one core to another and from one request to anothr inside the same core. Current software solutions, such as prefetching in multiple buffers depend on the fact that latency to move data from memory to all processor’s cores will be the same at run-time. However, this is not true even in current multi-core processors. For example, inside the Cell Broadband Engine, the DMA latency (or memory latency) differs from one core to another depending on its physical location inside the chip and how far it is from memory controller. This variance will be even bigger as these processors grow in number of cores and as contension increases on shared resources inside them. Such variance requires solutions that would hide memory latency dynamically at run-time inside each core based on each specific core’s data latency. Current software solutions, such as prefetching and multi-buffering are depending on constant memory latency across all cores.
Some hardware-based solutions tried to solve this problem through hyper- or multi-threading. Inside multi-core processors with multi-threaded feature, once a thread is blocked for an I/O or data movement, another thread gets active and resumes execution. Sun Microsystems through its latest UltraSparc T2 & T2 Plus, added up to four threads per core, which gives at the end large number of virtually concurrent threads on the same chip. However, there are two important drawbacks. First, if memory latency is pretty low for any reason these threads will be spending most of their time switching, which would give at the end a semi serial performance because four threads are sharing the same ALU and FP units. On the other hand, if memory latency is really high for all of the working threads inside the same core, we may end up with idle time because all of them will be waiting for data to come from system’s memory or an I/O device. Second, it this solution adds complexity to the hardware and consume space that could be used for bigger cache or even more single threaded cores.

Nano-Kernels
Ok, what would be the solution then? If we could dynamically create a threading framework that can create and manage threads pretty much to hyper- or multi-threaded architectures, we may be able to solve data latency problem smartly and for massively parallel multi-core processors. As long as each core will have its own data latency, why don’t we create small software threads that would switch their context to core’s local cache instead of switching it to the system’s main memory or second level cache. The context in this case will be the core’s registers (pretty much similar to current hardware based multi-threaded architectures) and few control registers affecting the execution of the thread, such as the program counter. So whenever one of these small threads, let’s call them micro-threads, stalls for a data chunk to be copied from system’s main memory, it will go to sleep mode and another micro-thread is switched to running mode and resume execution. A very small and very fast kernel, we may call it a nano-kernel, should actively run inside these cores to schedule micro-threads and make sure that data movement latency is hidden almost completely inside each core. This idea of having micro-threads has two advantages. First, the number of micro-threads is dynamic, which means number of micro-threads depends on data movement latency. For example, in large data movement latency we may add more micro-threads per core to work longer during other micro-threads wait time for data to be ready in core’s cache. Second, context switching inside each core’s cache makes it very cheap and very fast process, i.e. few of nano seconds. Of course, context saving will consume from each core’s cache but this is already consumed by several magnitudes to implement the hardware based multi-threaded architectures. Also, this will require specific faciltities provided by the ISA. For example, manual cache management and internal interrupting facility inside each core are mandatory for this idea to work.
So, if we create nano-kernels doing optimization inside each core, we would reach new performance ceilings. It is scalable since each core has its own nano-kernel working independently and scheduling micro-threads based on resources given to the core. So, with thens of throusand of threads this solution would still work and get the most out of the expected massively parallel multi-core processors.