Fault tolerance is an approach by which reliability of a computer system can be increased beyond what can be achieved by traditional methods. While hardware supported fault tolerance has been well-documented, the newer, software supported fault tolerance techniques have remained scattered throu
Fault-Tolerant Parallel and Distributed Systems
β Scribed by Dimiter R. Avresky, David R. Kaeli (auth.)
- Publisher
- Springer US
- Year
- 1998
- Tongue
- English
- Leaves
- 395
- Edition
- 1
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and everyΒ thing is inter-networked. The application will be dominated by storage, search, retrieval, analysis, exchange and updating of information in a wide variety of forms. Heavy demands will be placed on systems by many simultaneous reΒ quests. And, fundamentally, all this shall be delivered at much higher levels of dependability, integrity and security. Increasingly, large parallel computing systems and networks are providing unique challenges to industry and academia in dependable computing, espeΒ cially because of the higher failure rates intrinsic to these systems. The chalΒ lenge in the last part of this decade is to build a systems that is both inexpensive and highly available. A machine cluster built of commodity hardware parts, with each node runΒ ning an OS instance and a set of applications extended to be fault resilient can satisfy the new stringent high-availability requirements. The focus of this book is to present recent techniques and methods for imΒ plementing fault-tolerant parallel and distributed computing systems. Section I, Fault-Tolerant Protocols, considers basic techniques for achieving fault-tolerance in communication protocols for distributed systems, including synchronous and asynchronous group communication, static total causal orderΒ ing protocols, and fail-aware datagram service that supports communications by time.
β¦ Table of Contents
Front Matter....Pages i-xiii
Front Matter....Pages 1-1
Comparing Synchronous and Asynchronous Group Communication....Pages 3-24
Using Static Total Causal Ordering Protocols to Achieve Ordered View Synchrony....Pages 25-54
A Fail-Aware Datagram Service....Pages 55-69
Front Matter....Pages 71-71
Portable Checkpointing for Heterogeneous Architectures....Pages 73-91
A Checkpointing-Recovery Scheme for Domino-Free Distributed Systems....Pages 93-107
Overview of a Fault-Tolerant System....Pages 109-121
An Efficient Recoverable DSM on a Network of Workstations: Design and Implementation....Pages 123-138
Fault-Tolerance Issues of Local Area Multiprocessor (LAMP) Storage Subsystem....Pages 139-153
Fault-Tolerance Issues in RDBMS on SCI-Based Local Area Multiprocessor (LAMP)....Pages 155-169
Front Matter....Pages 171-171
Distributed Safety-Critical Systems....Pages 173-194
Dependability and Other Challenges in the Collision Between Computing and Telecommunication....Pages 195-211
A Unified Approach for the Synthesis of Scalable and Testable Embedded Architectures....Pages 213-230
A Fault-Robust SPMD Architecture for 3D-TV Image Processing....Pages 231-245
Front Matter....Pages 247-247
A Parallel Algorithm for Embedding Complete Binary Trees in Faulty Hypercubes....Pages 249-265
Fault-Tolerant Broadcasting in a k -ary n -Cube....Pages 267-283
Fault Isolation and Diagnosis in Multiprocessor Systems with Point-to-Point Communication Links....Pages 285-300
An Efficient Hardware Fault-Tolerant Technique....Pages 301-314
Reliability Evaluation of a Task under a Hardware Fault-Tolerant Technique....Pages 315-327
Fault Tolerance Measures for m -ary n -Dimensional Hypercubes Based on Forbidden Faulty Sets....Pages 329-340
On-Line Fault Recovery for Wormhole-Routed Two-Dimensional Meshes....Pages 341-356
Front Matter....Pages 247-247
Fault-Tolerant Dynamic Task Scheduling Based on Dataflow Graphs....Pages 357-371
A Novel Replication Technique for Implementing Fault-Tolerant Parallel Software....Pages 373-384
User-Transparent Checkpointing and Restart for Parallel Computers....Pages 385-399
Back Matter....Pages 401-401
β¦ Subjects
Processor Architectures
π SIMILAR VOLUMES
The ISIS system transforms abstract type specifications into fault-tolerant distributed implementations, while insulating users from the mechanisms whereby fault-tolerance is achieved. This paper discusses the transformations that are used within ISIS, methods for achieving improved performance by c
<p><em>Fault-Tolerant Parallel Computation</em> presents recent advances in algorithmic ways of introducing fault-tolerance in multiprocessors under the constraint of preserving efficiency. The difficulty associated with combining fault-tolerance and efficiency is that the two have conflicting means