FFMK: A Fast and Fault-tolerant Microkernel-based System for Exascale Computing

Workshop on Interfaces for an Exascale OS

7/8 September 2016, Hebrew University of Jerusalem

The SPPEXA project ”FFMK“ organizes its second informal, invitation-only workshop on interfaces for an exascale operating system. The workshop is planned to be an informal meeting of researchers from the systems community and HPC application developers to exchange on interfaces and requirements for an exascale-capable operating system platform. The workshop aims at informing the invited researchers about FFMK's goals and achievements in the 3.5 years, as well as getting insights from them on APIs and facilities that an operating system and runtime should provide. Focus areas are communication, load management and fault-tolerance mechanisms.

The workshop is co-located with FFMK's internal project meeting at Hebrew University of Jerusalem, which will take place during the two days before the SPPEXA workshop.

FFMK Principal Investigators

Hermann Härtig: Technische Universität Dresden
Alexander Reinefeld: Zuse Institute Berlin
Amnon Barak: Hebrew University of Jerusalem
Wolfgang E. Nagel: ZIH, Technische Universität Dresden

Guest Speakers

Pete Beckman: Argonne National Lab
Ron Brightwell: Sandia Labs
Balazs Gerofi: RIKEN
Torsten Hoefler: ETH Zurich
Ivo Kabadshow: FZ Julich
Laxmikant V. Kalé: University of Illinois at Urbana-Champain
Rolf Riesen: Intel

Workshop Program

The workshop will start with presentations by the FFMK partners on the following topics:

FFMK status and research agenda for the second phase of the project
Overview of the L4 microkernel platform and execution model
Corrected gossip algorithms for fast reliable broadcast, aiming at load balancing and system management
Migration approaches and load balancing strategies
Maximizing resource utilization for managed checkpoint activities

The FFMK introduction will be followed by one and a half days of invited talks and open discussions.

Argo (Pete Beckman, Argonne National Lab)

The Argo project built a collection OS and runtime components for dynamic, extreme-scale systems: A set of node-OS components built atop of Linux, a backplane for out-of-band communication across the system, a lightweight thread layer for massive parallelism, and a global optimization layer for power and performance.

Embracing Diversity: OS Support for Integrating High-Performance Computing and Data Analytics (Ron Brightwell, Sandia Labs)

It is unlikely that one operating system or a single software stack will support the emerging and future needs of the high-performance computing and high-performance data analytics applications. There are many technical and non-technical reasons why functional partitioning through customized software stacks will continue to persist. Rather than pursuing approaches that constrain the ability to provide a system software environment that satisfies a diverse and competing set of requirements, methods and interfaces that enable the use and integration of multiple software stacks should be pursued. This talk will describe the challenges that motivate the need to support multiple concurrent software stacks for enabling application composition, more complex application workflows, and a potentially richer set of usage models for extreme-scale high-performance computing systems. The Hobbes project led by Sandia National Laboratories has been exploring operating system infrastructure for supporting multiple concurrent software stacks. This talk will describe this infrastructure, relevant interfaces, and highlight issues that motivate future exploration.

An Overview of the IHK/McKernel Lightweight Multi-kernel for Extreme Scale HPC (Balazs Gerofi, RIKEN)

RIKEN Advanced Institute for Computation Science has been appointed by the Japanese government as the main organization for leading the development of Japan's next generation flagship supercomputer, the successor of the K Computer. Part of this effort is to design and develop a system software stack that suits the needs of future extreme scale computing. In this talk, we first provide a brief overview of RIKEN's system software stack effort covering various topics, including operating systems, I/O and networking. We then narrow the focus on OS research and describe IHK/McKernel, our hybrid operating system framework. IHK/McKernel runs Linux with a light-weight kernel side-by-side on compute nodes with the primary motivation of providing scalable, consistent performance for large scale HPC simulations, but at the same time, to retain a fully Linux compatible execution environment. We detail the organization of the stack, suggest OS APIs we envision based on runtime and application requirements targeting the post-K machine and provide preliminary performance results.

Scheduling-Aware Routing for Supercomputers (Torsten Hoefler, ETH Zurich)

The interconnection network has a large influence on total cost, application performance, energy consumption, and overall system efficiency of a supercomputer. Unfortunately, today's routing algorithms do not utilize this important resource most efficiently. We first demonstrate this by defining the dark fiber metric as a measure of unused resource in networks. To improve the utilization, we propose scheduling-aware routing, a new technique that uses the current state of the batch system to determine a new set of network routes and so increases overall system utilization by up to 17.74%. We also show that our proposed routing increases the throughput of communication benchmarks by up to 17.6% on a practical InfiniBand installation. Our routing method is implemented in the standard InfiniBand tool set and can immediately be used to optimize systems. In fact, we are using it to improve the utilization of our production petascale supercomputer for more than one year.

Towards a task-based Fast Multipole Method (Ivo Kabadshow, FZ Julich)

Abstract TBA

Charm++: Adaptive Runtime Systems at Exascale (Laxmikant V. Kalé, University of Illinois at Urbana-Champain)

In Charm++, we are exploring the idea that overdecomposition and migratability provide the necessary ingredients for highly powerful adaptive runtime systems. The programmer decomposes the computation and data into a relatively large number of objects that are assigned to physical resources by the runtime system (RTS). These objects can be migrated to other processors at runtime under the control of RTS. We have demonstrated how these features can be used to provide dynamic load balancing. More pertinently, they can be used to tolerate faults and to optimize for power/energy/temperature, typically considered the domains of operating systems. I will describe our recent research in extending these capabilities to exascale and to emerging HPC applications. Further, I will describe a whole-machine scheduling/runtime system that can optimize desired metrics for a mix of jobs running on a supercomputer. I will also describe our recent experience in porting Charm++ to Argobots, and my reflections on the relationship and boundaries between OS and runtime.

Outdated Linux/POSIX APIs pose a threat to modern lightweight kernels (Rolf Riesen, Intel)

Lightweight kernels have allowed applications to scale and perform well on the largest computing systems in the world. Recently, efforts have been undertaken to make these kernels more Linux compatible. This has the benefits of making these systems more familiar and easier to use, improves tool compatibility, and allows for easier integration into modern work flows. There are also some drawbacks that threaten the scalability and performance of lightweight kernels.

Many Linux and POSIX APIs are a poor match for the requirements of high-end HPC. In this talk we look at why this is, show examples of mismatches, explain how it impacts the design and implementation of mOS, and look at what can be done to improve the situation in the future.

Workshop on Interfaces for an Exascale OS

8/9 December 2014, TU Dresden

The SPPEXA project "FFMK" organizes an informal, invitation-only workshop on interfaces for an exascale operating system. The workshop is planned to be an informal meeting of researchers from the systems community and HPC application developers to exchange on interfaces and requirements for an exascale-capable operating system platform. The workshop aims at informing the invited researchers about FFMK's goals and achievements in the first two years, as well as getting insights from them on APIs and facilities that an operating system and runtime should provide. Focus areas are communication, load management and fault-tolerance mechanisms.

The workshop is to be co-located with FFMK's internal project meeting at TU Dresden, which will take place during the three days after the planned workshop.

FFMK Principal Investigators

Hermann Härtig: Technische Universität Dresden
Alexander Reinefeld: Zuse Institute Berlin
Amnon Barak: Hebrew University of Jerusalem
Wolfgang E. Nagel: ZIH, Technische Universität Dresden

Guest Speakers

Cyril Bordage: University of Illinois
Ron Brightwell: Sandia Labs
Michael Bussmann: Helmholtz-Forschungszentrum Rossendorf
Torsten Hoefler: ETH Zurich
Denis Hünich: ZIH, TU Dresden
Ivo Kabadshow: FZ Julich
Frank Mueller: North Carolina State University
Vijay Saraswat: IBM
Gerhard Wellein: University of Erlangen/Nuremberg
Karsten Schwan: Georgia Tech

Workshop Program

The workshop will start with presentations by the FFMK partners on the following topics:

FFMK vision, general architecture and major design challenges
Overview of the L4 microkernel platform
MPI runtime and Infiniband driver support
Scalabale and fault-tolerant gossip algorithms for load balancing and health monitoring
Checkpoint store based on XtreemFS

The FFMK introduction will be followed by one and a half days of invited talks and open discussions.

Cyril Bordage, University of Illinois
Ron Brightwell, Sandia Labs
Michael Bussmann, Helmholtz-Forschungszentrum Rossendorf
Torsten Hoefler, ETH Zurich
Denis Hünich, ZIH, TU Dresden
Ivo Kabadshow, FZ Julich
Frank Mueller, North Carolina State University
Vijay Saraswat, IBM
Gerhard Wellein, University of Erlangen/Nuremberg
Karsten Schwan, Georgia Tech

Workshop on System Software for Exascale Computing

11-13 December 2013, Hebrew University of Jerusalem

The SPPEXA project "FFMK" organizes an informal, invitation-only workshop on system software for exascale computing. The workshop brings researchers from the system community as well as application developers together in one place, allowing them to exchange on how to build an operating system platform for exascale machines. The workshop aims at informing the invited researchers about FFMK's goals and achievements in the first year, as well as getting insights from them about system software for exascale computing.

The workshop is co-located with FFMK's internal project meeting at Hebrew University of Jerusalem, which will take place during the two days before the SPPEXA workshop.

FFMK Principal Investigators

Hermann Härtig: Technische Universität Dresden
Alexander Reinefeld: Zuse Institute Berlin
Amnon Barak: Hebrew University of Jerusalem
Wolfgang E. Nagel: ZIH, Technische Universität Dresden

Guest Speakers

Pete Beckmann: Argonne National Lab
Marius Hillenbrand: Karlsruhe Institute of Technology
Torsten Hoefler: ETH Zurich
Laxmikant Kale: University of Illinois
Frank Mueller: North Carolina State University
Michael Kagan: Mellanox Technologies
Joost VandeVondele: ETH Zurich
Qingbo Wu: National University of Defense Technology, China

Workshop Program

The workshop will start with presentations by the FFMK partners on the following topics:

FFMK vision, general architecture and major design challenges
Overview of the L4 microkernel platform
Porting an MPI runtime to the FFMK architecture and how to integrate Infiniband driver support
Scalabale and fault-tolerant gossip algorithms for load balancing and health monitoring
Adapting XtreemFS for the FFMK architecture
Application characteristics

The FFMK introduction will be followed by one and a half days of invited talks and open discussions.

Preliminary title: Argo OS (Pete Beckmann, Argonne National Lab)

Abstract TBA

FusedOS: HPC and Commodity Workloads on Exascale Systems (Marius Hillenbrand, Karlsruhe Institute of Technology)

FusedOS is a research operating system for Exascale systems. Its main design goals are: (1) leveraging the specific advantages of supercomputer platforms, and (2) providing compatibility with commodity OS environments (e.g., Linux) at the same time.
Operating systems for high-performance computing have traditionally fallen in two categories: Minimalistic and custom OSes designed and built from scratch (so-called lightweight-kernels, LWK), or commodity OSes modified for use in HPC and customized to the underlying HW platform (so-called full-weight kernels, FWK). We believe that neither approach is a viable option for fulfilling FusedOS' goals. Instead, we are convinced that running future workloads on efficient Exascale platforms will demand a combination of both designs.
I will present how we combine both an FWK and an LWK approach in FusedOS. I will discuss our experiences with running HPC applications and commodity workloads in FusedOS. Further, I will show how the design of FusedOS allows HPC and commodity workloads to interact, enabling new paradigms and tools in an HPC environment.
Our prototype runs on IBM BlueGene/Q supercomputers and supports both Linux workloads and production HPC applications written for the production LWK on Blue Gene/Q. We have recently released that prototype as open source. In the final part of my talk, I will present how our prototype turns a Blue Gene/Q partition into an environment that closely resembles a regular Linux cluster.

Fault Tolerance for Exascale Computing (Frank Mueller, North Carolina State University)

Exascale computing is projected to feature billion core parallelism. At such large processor counts, faults will become more common place. Current techniques to tolerate faults focus on reactive schemes for recovery and generally rely on a simple checkpoint/restart mechanism. Yet, they have a number of shortcomings. (1) They do not scale and require complete job restarts. (2) Projections indicate that the mean-time-between-failures is approaching the overhead required for checkpointing. (3) Existing approaches are application-centric, which increases the burden on application programmers and reduces portability.
To address these problems, we discuss a number of techniques and their level of maturity (or lack thereof) to address these problems. These include (a) scalable network overlays, (b) on-the-fly process recovery, (c) proactive process-level fault tolerance, (d) redundant execution, (e) the effort of SDCs on IEEE floating point arithmetic and (f) resilience modeling. In combination, these methods are aimed to pave the path to exascale computing.

Preliminary title: Tian-He 2 (Qingbo Wu, National University of Defense Technology, China)

Abstract TBA

On process orders, hardware performance models, and close-to-optimal communications (Torsten Hoefler, ETH Zurich)

In this talk, we will discuss the influence of process ordering on parallel computers. We will demonstrate a general approach for process reordering and topology-aware communication optimization. From those general communication and mapping models, we will then deep-dive into the Xeon Phi architecture. We will show how to design a performance model for the cache coherence protocol and how to utilize it to design close-to-optimal communication algorithms for Intel's Xeon Phi. Our techniques for topology mapping and communication optimization are an important base for parallel operating system and runtime design.

Preliminary title: Exascale Networks (Michael Kagan, Mellanox Technologies)

Abstract TBA

Charm++: Overdecomposition enables powerful runtime optimization (Laxmikant Kale, University of Illinois)

The upcoming move towards the exascale era is characterized by single-digit nanometer feature sizes and concomitant process variations, and the rising importance of thermal/power/energy and failure issues. We posit over-decompoosition and migratability as key ideas that will empower the runtime systems to handle the challenges posed by these issues, especially for science and engineering applications. We have been exploring these concepts in the context of the Charm++ parallel programming system, focused originally by the needs of dynamic applications. I will describe our research on controlling chip temperature, constraining power, and minimizing energy in various contexts. We argue that there is a need for an adaptive control at the level of a single job, as well as at the level of the entire parallel machine running multiple jobs, and show how overdecomposition/migratibility give us the right tools for facilitating a rich dialogue between these two levels of operation. "Persistence" is another property of CSE applications, especially once they are expressed using an over-decomposed parallel programming model. I will describe the prevalence and utility of persistence, which I view as one of our few "friends" in the otherwise hostile landscape of exascale computing.

CP2K: Electrons at the Petascale (Joost VandeVondele, ETH Zurich)

Electrons play a crucial role in chemistry, materials science, and physics. The CP2K code enables atomistic simulation including the electronic structure and aims to excel on massively parallel hardware with innovative algorithms. A rapidly growing user base is prominently present on some of Europe's largest computers. However, the challenges posed by rapidly changing computer hardware can not be underestimated. For the current hybrid supercomputer architectures this will be illustrated with an account on the development of a GPU accelerated sparse matrix library for linear scaling density functional theory.