FFMK: A Fast and Fault-tolerant Microkernel-based System for Exascale Computing

Project Description

This project addresses three key scalability obstacles of future Exascale systems: the vulnerability to system failures due to transient or permanent errors, the performance losses due to imbalances and the noise due to unpredictable interactions between HPC applications and the operating-system. We address these obstacles by designing, implementing and evaluating a prototypical system, which integrates three well-proven technologies:

The resulting system will be a fluid self-organizing platform for applications that require scaling up to Exascale performance. An important component of the project will be the adaptation of suitable HPC work loads to showcase our new platform. A demonstration of such applications on a prototype implementation is the primary objective of our project.

Architecture of a software running on multi-core node

System Architecture

Architecture of a high-performance distributed checkpointing system

Split XtreemFS Architecture

Dynmaic Platform Management

Dynamic Platform Management

Fault Tolerance

Hardware Assumptions