Failing less or failing better

George Bosilca
Innovative Computing Laboratory
University of Tennessee, Knoxville, USA

Abstract:

The understanding of fault tolerance in parallel computing evolved in many ways. On the industry side, the need for resilience has been widely noticed, mechanism to deal with faults are widely available and deeply embedded in all programming paradigms and users are benefiting from them more or less automatically. On the academic side, especially in the High Performance Computing world, resilience is not yet considered a problem, and no viable long-term solution have been adopted by the community or by any mainstream parallel programming paradigm.

In this talk I will make a case about the usefulness of constructs able to facilitate the mitigation of faults, how they can be integrated into any distributed programming paradigms, and assess what impact they have on some science applications. Extensions to MPI will be used as an example, extensions that have proven capable of delivering at least an order of magnitude better performance than non-specialized solutions by providing MPI applications with the capability to detect and deal with faults and by empowering them to design their own resilience model.