[ACM Press 2011 International Conference for High Performance Computing, Networking, Storage and Analysis - Seattle, Washington (2011.11.12-2011.11.18)] Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 - FTI
โ Scribed by Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri; Cappello, Franck; Maruyama, Naoya; Matsuoka, Satoshi
- Book ID
- 121278208
- Publisher
- ACM Press
- Year
- 2011
- Tongue
- English
- Weight
- 523 KB
- Category
- Article
- ISBN
- 145030771X
No coin nor oath required. For personal study only.
โฆ Synopsis
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.
๐ SIMILAR VOLUMES