Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
✍ Scribed by James S. Plank; Youngbae Kim; Jack J. Dongarra
- Publisher
- Elsevier Science
- Year
- 1997
- Tongue
- English
- Weight
- 705 KB
- Volume
- 43
- Category
- Article
- ISSN
- 0743-7315
No coin nor oath required. For personal study only.
✦ Synopsis
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.