𝔖 Scriptorium
✦   LIBER   ✦

πŸ“

Dependable Computing. Design and Assessment

✍ Scribed by Ravishankar K. Iyer, Zbigniew T. Kalbarczyk, Nithin M. Nakka


Publisher
IEEE Press, Wiley Blackwell
Year
2024
Tongue
English
Leaves
851
Category
Library

⬇  Acquire This Volume

No coin nor oath required. For personal study only.

✦ Table of Contents


Cover
Series Page
Title Page
Copyright Page
Dedication Page
Contents
About the Authors
Preface
Acknowledgments
About the Companion Website
Chapter 1 Dependability Concepts and Taxonomy
1.1 Introduction
1.2 Placing Classical Dependability Techniques in Perspective
1.3 Taxonomy of Dependable Computing
1.3.1 Faults, Errors, and Failures
1.4 Fault Classes
1.5 The Fault Cycle and Dependability Measures
1.6 Fault and Error Classification
1.6.1 Hardware Faults
1.6.2 Software Faults and Errors
1.6.2.1 The GUARDIAN90 Operating System
1.6.2.2 IBM-MVS (zOS) and IBM Database Management Systems
1.7 Mean Time Between Failures
1.8 User-perceived System Dependability
1.9 Technology Trends and Failure Behavior
1.10 Issues at the Hardware Level
1.11 Issues at the Platform Level
1.12 What is Unique About this Book?
1.13 Overview of the Book
References
Chapter 2 Classical Dependability Techniques and Modern Computing Systems: Where and How Do They Meet?
2.1 Illustrative Case Studies of Design for Dependability
2.1.1 IBM System S/360
2.1.2 The Tandem Integrity System
2.1.3 Blue Waters
2.2 Cloud Computing: A Rapidly Expanding Computing Paradigm
2.2.1 Layered Architecture of Cloud Computing
2.2.2 Reliability Issues in Cloud Computing
2.3 New Application Domains
2.3.1 Smart Power Grid Application
2.3.2 Business Integrity Assurance Application
2.3.3 Medical Devices and Systems
2.3.3.1 Monitoring of Soldiers for Blast Impact in a Battlefield Scenario
2.3.3.2 Teleoperated Surgical Robots
2.3.4 Wireless Sensor Networks
2.3.5 Mobile Phones
2.3.6 Artificial Intelligence (AI) Systems
2.4 Insights
References
Chapter 3 Hardware Error Detection and Recovery Through Hardware-Implemented Techniques
3.1 Introduction
3.2 Redundancy Techniques
3.2.1 Comparing the Reliability of Simplex and TMR Systems
3.2.2 M-out-of-N Systems
3.2.3 The Effect of a Voter
3.2.4 Time Redundancy
3.3 Watchdog Timers
3.3.1 Example Applications of Watchdog Timers
3.3.2 Limitations of Watchdog Timers
3.4 Information Redundancy
3.4.1 A Brief History of Coding Theory
3.4.2 Outline of the Description of Coding Techniques
3.4.3 Fault Detection Through Encoding
3.4.4 Parity
3.4.5 Cyclic Redundancy Checks
3.4.6 Checksums
3.4.7 Arithmetic Codes
3.4.7.1 AN Codes
3.4.7.2 Berger Codes
3.4.8 Residue-Inverse Residue Codes
3.4.9 Reed-Solomon Codes
3.4.10 Communication Codes and Protocols
3.4.10.1 Convolutional Codes
3.4.10.2 Communication Protocols for Reliable Transmission
3.4.11 Two-Level Integrated Interleaved Codes
3.4.12 RAID: Redundant Array of Inexpensive Disks
3.4.12.1 A Commercial RAID-Based Storage System
3.5 Capability and Consistency Checking
3.5.1 Capability Checking
3.5.2 Consistency Checking
3.6 Insights
References
Chapter 4 Processor Level Error Detection and Recovery
4.1 Introduction
4.2 Logic-level Techniques
4.2.1 Radiation Hardening
4.2.2 Selective Node-Level Engineering
4.2.3 SEU Hardening for Memory Cells
4.2.4 SEU-tolerant Latch
4.2.4.1 Recovery from a Particle Strike
4.2.5 Razor
4.2.5.1 Pipeline Error Recovery
4.2.5.2 Discussion
4.2.6 Built-in Soft-Error Resilience Using Scan Flip-Flop Reuse
4.2.7 Discussion
4.3 Error Protection in the Processors
4.3.1 Reliability Features of Intel P6 Processor Family
4.3.1.1 Machine Check Architecture (MCA)
4.3.1.2 Functional Redundancy Checking (FRC)
4.3.2 Reliability Features in Itanium
4.3.2.1 Protection of On-Chip Memory Arrays
4.3.2.2 Error Containment
4.3.2.3 Data Poisoning
4.3.2.4 Error Promotion
4.3.2.5 Watchdog Timer
4.3.2.6 Error Detection and Correction Logging
4.3.3 POWER7
4.3.4 NonStop Himalaya Systems
4.4 Academic Research on Hardware-level Error Protection
4.4.1 SRTR: Transient Fault Recovery Using Simultaneous Multithreading
4.4.1.1 Discussion
4.4.2 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design
4.4.2.1 Discussion
4.4.3 Microprocessor-based Introspection (MBI)
4.4.4 Phoenix: Detection and Recovery from Permanent Process Design Bugs
4.5 Insights
References
Chapter 5 Hardware Error Detection Through Software-Implemented Techniques
5.1 Introduction
5.2 Duplication-based Software Detection Techniques
5.2.1 Examples of Software-based Duplication Techniques
5.2.1.1 Duplication at the Level of Source Code
5.2.1.2 ED4I
5.3 Control-Flow Checking
5.3.1 The State of the Art
5.3.1.1 Hardware Schemes
5.3.1.2 Software Schemes
5.3.2 Enhanced Control-Flow Checking with Assertions (ECCA)
5.3.2.1 Insertion of ECCA Assertions
5.3.2.2 SET and TEST Assertions in ECCA
5.3.2.3 ECCA Error Detection
5.3.2.4 Experimental Evaluation of ECCA
5.3.3 Preemptive Control Signature (PECOS)
5.3.3.1 PECOS Error Detection
5.3.3.2 Experimental Evaluation of PECOS
5.4 Heartbeats
5.4.1 Timeout Mechanism
5.4.2 Limitations of Traditional Heartbeats
5.4.3 Designing Adaptive, Smart Heartbeats
5.4.4 Evaluation of Smart Heartbeats
5.4.4.1 Experimental Methodology
5.4.4.2 Experimental Results
5.5 Assertions
5.6 Insights
References
Chapter 6 Software Error Detection and Recovery Through Software Analysis
6.1 Introduction
6.2 Diverse Programming
6.2.1 N-Version Programming
6.2.1.1 Applications of N-Version Programming
6.2.2 Recovery Blocks
6.2.2.1 Sequential Recovery Block Scheme
6.2.2.2 Designing an Acceptance Test
6.2.2.3 Distributed Applications: Recovery Block Conversations
6.2.2.4 Advanced Recovery Block Models and Real-Time Systems
6.3 Static Analysis Techniques
6.3.1 ESP: Path-Sensitive Program Verification in Polynomial Time
6.3.2 PR-Miner: Automatically Extracting Implicit Programming Rules and Detecting Violations in Large Software Code
6.3.3 Dynamic Derivation of Program Invariants
6.3.3.1 DAIKON
6.3.4 Statically Derived Application-Specific Detectors
6.3.4.1 Terms and Definitions
6.3.4.2 Steps in Detector Derivation
6.3.4.3 Example of Derived Detectors
6.3.4.4 Software Errors Covered
6.3.4.5 Hardware Errors Covered
6.3.4.6 Performance and Coverage Measurements
6.4 Error Detection Based on Dynamic Program Analysis
6.4.1 Fault Model
6.4.2 Derivation: Analysis and Design
6.4.2.1 Dynamic Derivation of Detectors
6.4.2.2 Detector Tightness and Execution Cost
6.4.2.3 Detector Derivation Algorithm
6.4.3 Experimental Evaluation
6.4.3.1 Application Programs
6.4.3.2 Infrastructure
6.4.3.3 Experimental Procedure
6.4.4 Results
6.4.4.1 Detection Coverage of Derived Detectors
6.4.4.2 False Positives
6.5 Processor-Level Selective Replication
6.5.1 Application Analysis
6.5.2 Overview of Selective Replication
6.5.3 Mechanism of Replication
6.6 Runtime Checking for Residual Software Bugs
6.6.1 Race Condition Checking in Multithreaded Programs
6.6.2 Array Bounds Checking
6.6.3 Runtime Verification
6.7 Data Audit
6.7.1 Static and Dynamic Data Check
6.7.2 Structural Check
6.7.3 Semantic Referential Integrity Check
6.7.4 Optimization Using Runtime Statistics
6.8 Application of Data Audit Techniques
6.8.1 Target System Software and Database Architecture
6.8.2 Audit Subsystem Architecture
6.8.2.1 The Heartbeat Element
6.8.2.2 The Progress Indicator Element
6.8.2.3 Audit Elements
6.8.3 Evaluating the Audit Subsystem
6.9 Insights
References
Chapter 7 Measurement-based Analysis of System Software: Operating System Failure Behavior
7.1 Introduction
7.2 MVS (Multiple Virtual Storage)
7.2.1 MVS Error Detection and Recovery Processing
7.2.2 MVS Error Detection
7.2.3 Recovery Processing
7.2.3.1 Hardware Error Recovery
7.2.3.2 MVS Software Error Recovery
7.2.4 Hardware-related Software Errors
7.2.4.1 Processing of Error Data
7.2.4.2 Analysis of Error Detection
7.2.4.3 Error Classification and Detection
7.2.4.4 Error Detection and Recovery
7.2.4.5 Detection of HW/SW Software Errors
7.2.5 Analysis of Hardware-related Software Errors
7.2.5.1 Recovery from HW/SW Errors
7.2.6 Summary of MVS Analysis
7.3 Experimental Analysis of OS Dependability
7.3.1 What to Measure and Why?
7.4 Behavior of the Linux Operating System in the Presence of Errors
7.4.1 Methodology
7.4.2 Error Injection Environment
7.4.2.1 Approach
7.4.2.2 Error Activation
7.4.2.3 Error Model
7.4.2.4 Outcome Categories
7.4.3 Overview of Experimental Results
7.4.4 Crash Cause Analysis
7.4.4.1 Stack Injection
7.4.4.2 System Register Injection
7.4.4.3 Code Injection
7.4.4.4 Data Injection
7.4.4.5 Summary
7.4.5 Crash Latency (Cycles-to-Crash) Analysis
7.4.6 Crash Severity
7.4.6.1 Lessons Learned
7.4.6.2 Value in Employing Fault Injection
7.4.6.3 Toolset and Benchmark Procedures
7.4.7 Summary
7.5 Evaluation of Process Pairs in Tandem GUARDIAN
7.5.1 Data Integrity
7.5.2 User Applications
7.5.3 Software Fault Tolerance of Process Pairs
7.5.3.1 Measure of Software Fault Tolerance
7.5.3.2 Outages Due to Software
7.5.3.3 Characterization of Software Fault Tolerance
7.5.4 Discussion
7.5.5 First Occurrences Versus Recurrences
7.5.6 Impact of Software Failures on Performance
7.5.7 Summary
7.6 Benchmarking Multiple Operating Systems : A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER
7.6.1 Introduction of Case Study
7.6.2 Experimental Setup
7.6.2.1 Fault Model
7.6.2.2 Target Systems
7.6.2.3 Experimental Environment
7.6.3 Evaluation Procedure
7.6.3.1 Generation of Injection Targets
7.6.3.2 Execution of Fault Injection Campaigns
7.6.3.3 Collection and Analysis of Data
7.6.4 Results
7.6.4.1 Comparison of Target Platforms’ Error Behavior
7.6.4.2 Feedback for Reliability Enhancements
7.6.5 Detailed Discussion and Analysis
7.6.5.1 Text Injection Analysis
7.6.5.2 Stack Injection Analysis
7.6.5.3 Register Injection Analysis
7.6.6 Conclusions
7.7 Dependability Overview of the Cisco Nexus Operating System
7.8 Evaluating Operating Systems: Related Studies
7.9 Insights
References
Chapter 8 Reliable Networked and Distributed Systems
8.1 Introduction
8.2 System Model
8.3 Failure Models
8.4 Agreement Protocols
8.4.1 Byzantine Agreement Problem: Solution
8.4.1.1 Oral Message Algorithm, OM( f )
8.4.2 Interactive Consistency Obtained by Running the Byzantine Agreement Protocol
8.5 Reliable Broadcast
8.5.1 Reliable Broadcast
8.5.2 FIFO (First-In-First-Out) Broadcast
8.5.3 Causal Broadcast
8.5.4 Total Order Broadcast
8.6 Reliable Group Communication
8.6.1 Specification of Group Communication Service
8.6.1.1 Specification of Group Membership Service
8.6.1.2 Specification of Reliable Multicast Service
8.6.2 Example Implementations of Group Communication Systems
8.7 Replication
8.7.1 Replication in Hardware
8.7.2 Replication in Software
8.7.2.1 Replication at the Level of the Operating System
8.7.2.2 Replication at the Level Between the Hardware and the Operating System
8.7.2.3 Replication at the Level Between the Operating System and the User Application
8.7.2.4 Replication at the User-Level
8.7.2.5 CORBA
8.7.3 The Problem of Nondeterminism
8.7.4 Paxos and Read-Write Quorums: A Practical Approach to Achieving Eventual Consistency
8.7.4.1 Paxos
8.7.5 Read-Write Quorums
8.8 Replication of Multithreaded Applications
8.8.1 System Model: Definitions and Assumptions
8.8.2 Specification of the LSA Algorithm
8.8.3 LSA Algorithm Overview
8.8.3.1 Failure Behavior with Error-Free Leader-to-Followers Communication
8.8.3.2 Failure Behavior with Byzantine Errors in Leader-to-Followers Communication
8.8.4 Specification of the PDS Algorithm
8.8.4.1 PDS-1 Algorithm Overview
8.8.4.2 PDS-2 Algorithm
8.8.5 Application-Transparent Replication Framework
8.8.5.1 Using the LSA and PDS Algorithms with Majority Voting
8.8.5.2 LSA and PDS Implementations
8.8.5.3 Virtual Socket Layer
8.8.5.4 Voter/Fanout Process
8.8.6 Performance-Dependability Trade-Offs
8.8.6.1 Performance Evaluation
8.8.6.2 Dependability Evaluation
8.8.6.3 Injections into a Replica Process
8.8.6.4 Lessons Learned
8.8.7 Conclusions
8.9 Atomic Commit
8.9.1 The Two-Phase Commit Protocol
8.9.1.1 Assumptions
8.9.1.2 Basic Algorithm
8.9.1.3 Disadvantages
8.9.1.4 The Detailed Two-Phase Commit Protocol
8.10 Opportunities and Challenges in Resource-Disaggregated Cloud Data Centers
8.10.1 Data Movement
8.10.2 Data Consistency
8.10.3 Fault Tolerance
8.10.4 ML-based Orchestration and Validation
References
Chapter 9 Checkpointing and Rollback Error Recovery
9.1 Introduction
9.2 Hardware-Implemented Cache-Based Schemes Checkpointing
9.2.1 Cache-Aided Rollback Error Recovery (CARER) for Uniprocessors
9.2.2 Multiprocessor Cache-Based Schemes
9.2.3 ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
9.3 Memory-Based Schemes
9.3.1 Physical Memory-Based Schemes
9.3.2 Virtual Memory-Based Schemes
9.4 Operating-System-Level Checkpointing
9.4.1 libckpt: Transparent Checkpointing Under Unix
9.4.1.1 Incremental Checkpointing
9.4.1.2 Forked Checkpointing
9.4.2 Fine-Grained Rollback and Deterministic Replay for Software Debugging
9.4.2.1 Rollback of Multithreaded Processes
9.4.3 Transparent Application Checkpoint (TAC) Module
9.4.3.1 RMK Framework
9.4.3.2 RMK Pins: System-Level RMK Interface
9.4.3.3 Application-Level RMK Interface
9.4.3.4 RMK Core
9.4.3.5 An Example RMK Module: Transparent Application Checkpoint (TAC)
9.5 Compiler-Assisted Checkpointing
9.5.1 CATCH – Compiler-Assisted Techniques for Checkpointing
9.5.1.1 Potential Checkpoints
9.5.1.2 Sparse Potential Checkpoints
9.5.1.3 Adaptive Checkpointing
9.5.2 Compiler-Assisted Checkpointing Using libckpt
9.5.2.1 Compiler Directives
9.6 Error Detection and Recovery in Distributed Systems
9.6.1 Synchronous Checkpointing
9.6.2 Asynchronous Checkpointing: Message Logging
9.6.3 Sender-Based Message Logging
9.6.3.1 Design and Motivation
9.6.3.2 A Practical Implementation
9.7 Checkpointing Latency Modeling
9.8 Checkpointing in Main Memory Database Systems (MMDB)
9.8.1 Checkpointing of MMDB Control Structures
9.8.1.1 Checkpointing Framework
9.8.1.2 Incremental Checkpointing
9.8.1.3 Delta Checkpointing
9.9 Checkpointing in Distributed Database Systems
9.9.1 Definitions
9.9.2 The Algorithm
9.9.2.1 Failure Recovery
9.10 Multithreaded Checkpointing
9.10.1 Dealing with Nondeterminism
References
Chapter 10 Checkpointing Large-Scale Systems
10.1 Introduction
10.2 Checkpointing Techniques
10.2.1 Checkpoint Coordination Techniques
10.2.2 Shared Memory Systems
10.2.3 I/O Techniques
10.2.4 Recovery Techniques
10.2.4.1 Use of Spares
10.3 Checkpointing in Selected Existing Systems
10.3.1 Blue Gene
10.3.2 Brazos
10.3.3 Winckp
10.3.4 Condor
10.3.5 Libckpt
10.3.6 Classification of Checkpointing Approaches in Existing Systems
10.3.7 Example of Evaluation of Checkpointing Schemes for a Large-Scale System
10.3.8 Determining Optimal Checkpointing Interval
10.4 Modeling-Coordinated Checkpointing for Large-Scale Supercomputers
10.4.1 Failure and Recovery
10.4.2 SAN-Based Modeling
10.4.2.1 Modeling Compute and Checkpointing
10.4.2.2 Modeling Correlated Failures
10.4.2.3 Results
10.5 Checkpointing in Large-Scale Systems: A Simulation Study
10.6 Cooperative Checkpointing
10.6.1 Other Terms and Definitions
10.6.2 Cooperative Checkpointing vs. Periodic Checkpointing
References
Chapter 11 Internals of Fault Injection Techniques
11.1 Introduction
11.2 Historical View of Software Fault Injection
11.3 Fault Model Attributes
11.4 Compile-Time Fault Injection
11.4.1 Source Code Mutation
11.4.2 Bytecode Mutation
11.5 Runtime Fault Injection
11.5.1 Time Trigger Faults
11.5.2 Runtime Mutation
11.5.2.1 Mutation of APIs and System Call Parameters
11.5.2.2 Software Probe
11.5.2.3 Network Messaging Faults
11.5.3 Library-Based Faults
11.5.4 Performance/Timing Faults
11.5.5 User-Space Ptrace-Based Faults
11.5.5.1 Fault Injection Using Trap Instruction
11.5.5.2 Fault Injection Using Debug Register
11.5.6 Fault Injection Using GDB
11.5.7 Kernel Space
11.5.7.1 Kernel Fault Injection
11.5.7.2 Driver
11.5.7.3 User Virtual Address
11.5.8 Configurable FPGAs
11.5.9 Security Threats
11.6 Simulation-Based Fault Injection
11.7 Dependability Benchmark Attributes
11.8 Architecture of a Fault Injection Environment : NFTAPE Fault/Error Injection Framework Configured to Evaluate Linux OS
11.8.1 Fault Injection Environment
11.8.2 Approach Overview
11.8.3 Kernel Profiling
11.8.3.1 Workload
11.8.3.2 Profiling
11.8.4 Hardware Monitoring
11.8.5 Control Host Overview
11.8.5.1 Target Generator
11.8.5.2 Injector Manager
11.8.6 Kernel-Level Support
11.8.6.1 Injection Controller
11.8.7 Breakpoint Handler
11.8.8 Crash Handler
11.8.9 Crash Dumper
11.8.10 Component Interactions
11.9 ML-Based Fault Injection: Evaluating Modern Autonomous Vehicles
11.9.1 DriveFI: Bayesian Fault Injection Framework
11.9.1.1 Autonomous Driving System Overview
11.9.1.2 Defining Safety
11.9.1.3 Fault Injection
11.9.1.4 Case Studies
11.9.2 Bayesian Fault Injection
11.9.2.1 Kinematics-Based Model of Safety
11.9.2.2 Machine Learning Model Describing the System’s Response Under Faults
11.9.3 The ADS Architecture and Simulation
11.9.3.1 Overview of ADS
11.9.3.2 Simulation Platform
11.9.4 DriveFI Architecture
11.9.4.1 Injecting into Computational Elements: GPU Fault Models
11.9.4.2 Injecting Faults into ADS Module Output Variables
11.9.5 Results
11.9.5.1 GPU-Level Fault Injection
11.9.5.2 Source-Level Fault Injections
11.9.5.3 Results of Bayesian FI-Based Injections
11.9.6 AV-Fuzzer: Fault Injection Framework Based on AI-Driven Fuzzing
11.9.7 Related Work
11.10 Insights and Concluding Remarks
References
Chapter 12 Measurement-Based Analysis of Large-Scale Clusters: Methodology
12.1 Introduction
12.2 Related Research
12.2.1 Failure Data Analysis in Specific Application Domains
12.2.2 Analysis of Data on Security Incidents
12.3 Steps in Field Failure Data Analysis
12.4 Failure Event Monitoring and Logging
12.4.1 Automated Error Logging
12.4.1.1 Syslog
12.4.1.2 Blue Waters Logs
12.4.1.3 IBM Z/OS Logs
12.4.1.4 IBM Blue Gene RAS Events
12.4.1.5 Windows Event Logging
12.4.2 Human-Generated Failure Reports
12.4.2.1 Bug Databases and Public User Forums
12.5 Data Processing
12.5.1 Data Filtering
12.5.1.1 Example: Processing of Public Computer-Related Recalls Databases for Safety-Critical Medical Devices
12.5.2 Data Coalescence
12.5.2.1 Time-Based Coalescence
12.5.2.2 Problems with Time-Based Coalescence
12.5.2.3 Example of Time-Based Spatial Coalescence of Failure Data from Blue Gene/L
12.5.2.4 Content-Based Event Coalescence
12.6 Data Analysis
12.6.1 Basic Statistics
12.6.2 Repair Rates
12.6.2.1 Example: Root Cause Analysis from 20 HPC Systems at LANL
12.6.2.2 Example: Analysis of Smartphone Users’ Failure Reports
12.6.2.3 Example: Analysis of Failures from LANs of Windows NT Machines
12.7 Estimation of Empirical Distributions
12.7.1 Hazard Rate Estimation
12.7.1.1 Hazard Rate Estimation from VAXclusters
12.7.1.2 Hazard Rate Estimation from a Software-as-a-Service Platform
12.8 Dependency Analysis
12.8.1 Workload/Failure Dependency
12.8.2 Failure Dependency Among Components
12.8.2.1 Steps in Correlation Analysis
12.8.3 Error Interaction Analysis
12.8.3.1 Hardware-Related Software Errors
References
Chapter 13 Measurement-Based Analysis of Large Systems: Case Studies
13.1 Introduction
13.2 Case Study I: Failure Characterization of a Production Software-as-a-Service Cloud Platform
13.2.1 Data Source
13.2.2 Failure Analysis Workflow
13.2.3 Failure Characterization
13.2.3.1 Output of the Coalescence Process
13.2.3.2 Key Factors Impacting Platform Failures
13.2.3.3 Impact of Timeout Errors
13.2.4 Failure Rate Analysis
13.2.4.1 Trend Analysis of the Platform Failure Rate
13.2.4.2 Impact of Platform Software Upgrades
13.2.4.3 Impact of the Workload Volume on the Platform Failure Rate
13.2.4.4 Impact of the Workload Intensity on the Platform Failure Rate
13.2.5 Conclusions
13.3 Case Study II: Analysis of Blue Waters System Failures
13.3.1 Data and Methodology
13.3.1.1 Characterization Methodology
13.3.2 Blue Waters Failure Causes
13.3.2.1 Breakdown of Failures
13.3.2.2 Effectiveness of Failover
13.3.3 Hardware Error Resiliency
13.3.3.1 Rate of Uncorrectable Errors Across Different Node Types
13.3.3.2 Hardware Failure Rates
13.3.3.3 Hardware Failure Trends
13.3.4 Characterization of Systemwide Outages
13.3.5 Conclusions
13.4 Case Study III: Autonomous Vehicles: Analysis of Human-Generated Data
13.4.1 Examples of AV-Related Accidents
13.4.2 AV System Description and Data Collection
13.4.2.1 AV Hierarchical Control Structure
13.4.2.2 Data Sources
13.4.3 Data-Analysis Workflow: Parsing, Filtering, Normalization, and NLP
13.4.4 Statistical Analysis of Failures in AVs
13.4.4.1 Analysis of AV Disengagement Reports
13.4.4.2 Analysis of AV Accident Reports
13.4.5 Discussion
13.4.6 Limitations of this Study
13.4.7 Related Work
13.4.8 Insights and Conclusions
References
Chapter 14 The Future: Dependable and Trustworthy AI Systems
14.1 Introduction
14.2 Building Trustworthy AI Systems
14.2.1 An AI System and Its Key Components
14.2.2 A System Perspective on Trust in AI Systems
14.3 Offline Identification of Deficiencies
14.3.1 Assessment and Validation of a System and Its Design
14.3.1.1 Formal Verification
14.3.1.2 Traditional End-to-End Random Fault Injection
14.3.1.3 Model-driven Fuzzing, Falsification, and Fault Injection
14.3.2 Post-Mortem Analysis to Track the Causes of Incidents Systematically
14.3.2.1 Adversarial Learning: A Red Team Approach
14.3.2.2 Adversarial Learning: A Systematic Approach to Mislead AI Systems
14.3.2.3 Generative Adversarial Networks
14.3.3 Smart Malware with Self-Learning Capabilities
14.4 Online Detection and Mitigation
14.4.1 Formalization
14.4.2 Monitoring
14.4.3 Mitigation
14.5 Trust Model Formulation
14.5.1 An Illustrative Trust Model
14.6 Modeling the Trustworthiness of Critical Applications
14.6.1 Autonomous Vehicles and Transportation
14.6.1.1 Addressing Uncertainty
14.6.2 Large-Scale Computing Infrastructure
14.6.2.1 Model Formulation
14.6.2.2 Addressing Uncertainty
14.6.3 Healthcare AI/ML
14.6.3.1 Model Formulation
14.6.3.2 Addressing Uncertainty
14.7 Conclusion: How Can We Make AI Systems Trustworthy?
References
Index
EULA


πŸ“œ SIMILAR VOLUMES


Design of Dependable Computing Systems
✍ Jean-Claude Geffroy, Gilles Motet (auth.) πŸ“‚ Library πŸ“… 2002 πŸ› Springer Netherlands 🌐 English

<p>This book analyzes the causes of failures in computing systems, their consequences, as weIl as the existing solutions to manage them. The domain is tackled in a progressive and educational manner with two objectives: 1. The mastering of the basics of dependability domain at system level, that is

Computer Design and Computational Defens
✍ Nikos E. Mastorakis πŸ“‚ Library πŸ“… 2011 πŸ› Nova Science Publishers, Incorporated 🌐 English

This book presents and discusses research in the study of computer science, with a particular focus on computer design and computational defense systems. Topics discussed include memory grid mapping; optimal nozzle design with monotonicity constraints; statistical reliability with applications to de

Dependable Network Computing
✍ Jean-Claude Laprie (auth.), Dimiter R. Avresky (eds.) πŸ“‚ Library πŸ“… 2000 πŸ› Springer US 🌐 English

<p><em>Dependable Network Computing</em> provides insights into various problems facing millions of global users resulting from the `internet revolution'. It covers real-time problems involving software, servers, and large-scale storage systems with adaptive fault-tolerant routing and dynamic reconf