๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

[ACM Press 2011 International Conference for High Performance Computing, Networking, Storage and Analysis - Seattle, Washington (2011.11.12-2011.11.18)] Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 - FTI

โœ Scribed by Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri; Cappello, Franck; Maruyama, Naoya; Matsuoka, Satoshi


Book ID
121278208
Publisher
ACM Press
Year
2011
Tongue
English
Weight
523 KB
Category
Article
ISBN
145030771X

No coin nor oath required. For personal study only.

โœฆ Synopsis


Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.


๐Ÿ“œ SIMILAR VOLUMES


[ACM Press 2011 International Conference