𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Application Level Fault Tolerance in Heterogeneous Networks of Workstations

✍ Scribed by Adam Beguelin; Erik Seligman; Peter Stephan


Publisher
Elsevier Science
Year
1997
Tongue
English
Weight
425 KB
Volume
43
Category
Article
ISSN
0743-7315

No coin nor oath required. For personal study only.

✦ Synopsis


We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessorbased technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.


πŸ“œ SIMILAR VOLUMES


A Real-Time Parallel Application:: The D
✍ Stefano Marano; Mario Medugno; Maurizio Longo πŸ“‚ Article πŸ“… 1998 πŸ› Elsevier Science 🌐 English βš– 266 KB

We deal with the detection of gravitational chirp signals among noisy data, where the reception and the detection are piped and run in parallel. We consider the classical theory of signal detection, which yields a detector with a "bank-of-filters" structure. We investigate distributed network comput