๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Fault-Tolerance Techniques for High-Performance Computing

โœ Scribed by Thomas Herault, Yves Robert (eds.)


Publisher
Springer International Publishing
Year
2015
Tongue
English
Leaves
325
Series
Computer Communications and Networks
Edition
1
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

โœฆ Table of Contents


Front Matter....Pages i-ix
Front Matter....Pages 1-1
Fault Tolerance Techniques for High-Performance Computing....Pages 3-85
Front Matter....Pages 87-87
Errors and Faults....Pages 89-144
Fault-Tolerant MPI....Pages 145-228
Using Replication for Resilience on Exascale Systems....Pages 229-278
Energy-Aware Checkpointing Strategies....Pages 279-317
Back Matter....Pages 319-320

โœฆ Subjects


System Performance and Evaluation; Performance and Reliability; Numeric Computing


๐Ÿ“œ SIMILAR VOLUMES


Fault-Tolerance Techniques for High-Perf
โœ Thomas Herault, Yves Robert (eds.) ๐Ÿ“‚ Library ๐Ÿ“… 2015 ๐Ÿ› Springer International Publishing ๐ŸŒ English

<p>This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, tog

Fault-Tolerance Techniques for Spacecraf
โœ Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong ๐Ÿ“‚ Library ๐Ÿ“… 2017 ๐Ÿ› John Wiley & Sons ๐ŸŒ English

<i><b>Comprehensive coverage of all aspects of space application oriented fault tolerance techniques </b></i><br /><br />โ€ข Experienced expert author working on fault tolerance for Chinese space program for almost three decades<br />โ€ข Initiatively provides a systematic texts for the cutting-edge faul

Techniques for Optimizing Applications:
โœ Garg R.P. ๐Ÿ“‚ Library ๐ŸŒ English

Prentice Hall, 2001 โ€” 672 p.<br/>This book is a practical guide to performance optimization of computationally intensive programs on Sun UltraSPARC platforms. It is primarily intended for developers of technical or high performance computing (HPC) applications for the Solaris(tm) operating environme

Coupled Data Communication Techniques fo
โœ Dr. Ron Ho, Dr. Robert Drost (auth.), Ron Ho, Robert Drost (eds.) ๐Ÿ“‚ Library ๐Ÿ“… 2010 ๐Ÿ› Springer US ๐ŸŒ English

<p>Designers of next-generation high-performance computer systems face a host of technical challenges. For the past several decades, rising clock frequencies and increased chip integration have fueled the growth of computer performance. Now these trends have slowed: power and complexity constrains f