[ACM Press the 20th international symposium - San Jose, California, USA (2011.06.08-2011.06.11)] Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 - Algorithm-based recovery for iterative methods without checkpointing
โ Scribed by Chen, Zizhong
- Book ID
- 120594987
- Publisher
- ACM Press
- Year
- 2011
- Weight
- 779 KB
- Category
- Article
- ISBN
- 1450305520
No coin nor oath required. For personal study only.
โฆ Synopsis
In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor rollback is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced only when an actual failure occurs. Experimental results demonstrate that, when it works, the proposed scheme introduces much less overhead than checkpointing on the current world's eighth-fastest supercomputer Kraken.
๐ SIMILAR VOLUMES
We consider the problem of self-healing in reconfigurable networks (e.g. peer-to-peer and wireless mesh networks) that are under repeated attack by an omniscient adversary and propose a fully distributed algorithm, Xheal , that maintains good expansion and spectral properties of the network, also ke