✦ LIBER ✦

[ACM Press the 20th international symposium - San Jose, California, USA (2011.06.08-2011.06.11)] Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 - Algorithm-based recovery for iterative methods without checkpointing

✍ Scribed by Chen, Zizhong

Book ID: 120594987
Publisher: ACM Press
Year: 2011
Weight: 779 KB
Category: Article
ISBN: 1450305520
DOI: 10.1145/1996130.1996142

No coin nor oath required. For personal study only.

✦ Synopsis

In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor rollback is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced only when an actual failure occurs. Experimental results demonstrate that, when it works, the proposed scheme introduces much less overhead than checkpointing on the current world's eighth-fastest supercomputer Kraken.

📜 SIMILAR VOLUMES

[ACM Press the 20th international sympos

[ACM Press the 20th international symposium - San Jose, California, USA (2011.06.08-2011.06.11)] Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 - Algorithm-based recovery for iterative methods without checkpointing

✍ Chen, Zizhong 📂 Article 📅 2011 🏛 ACM Press ⚖ 779 KB

[ACM Press the 20th international sympos

✍ Al-Kiswany, Samer; Subhraveti, Dinesh; Sarkar, Prasenjit; Ripeanu, Matei 📂 Article 📅 2011 🏛 ACM Press ⚖ 620 KB

[ACM Press the 20th international sympos

✍ Deshpande, Umesh; Wang, Xiaoshuang; Gopalan, Kartik 📂 Article 📅 2011 🏛 ACM Press ⚖ 742 KB

[ACM Press the 20th international sympos

[ACM Press the 20th international symposium - San Jose, California, USA (2011.06.08-2011.06.11)] Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 - Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

✍ Ravi, Vignesh T.; Becchi, Michela; Agrawal, Gagan; Chakradhar, Srimat 📂 Article 📅 2011 🏛 ACM Press ⚖ 854 KB

[ACM Press the 30th annual ACM SIGACT-SI

[ACM Press the 30th annual ACM SIGACT-SIGOPS symposium - San Jose, California, USA (2011.06.06-2011.06.08)] Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing - PODC '11 - Xheal

✍ Pandurangan, Gopal; Trehan, Amitabh 📂 Article 📅 2011 🏛 ACM Press 🌐 English ⚖ 536 KB

[ACM Press the 30th annual ACM SIGACT-SI

✍ Pandurangan, Gopal; Trehan, Amitabh 📂 Article 📅 2011 🏛 ACM Press 🌐 English ⚖ 536 KB

We consider the problem of self-healing in reconfigurable networks (e.g. peer-to-peer and wireless mesh networks) that are under repeated attack by an omniscient adversary and propose a fully distributed algorithm, Xheal , that maintains good expansion and spectral properties of the network, also ke