𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

✍ Scribed by James S. Plank; Youngbae Kim; Jack J. Dongarra


Publisher
Elsevier Science
Year
1997
Tongue
English
Weight
705 KB
Volume
43
Category
Article
ISSN
0743-7315

No coin nor oath required. For personal study only.

✦ Synopsis


Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.