Is Parallel Programming Hard, And, If So, What Can You Do About It
β Scribed by McKenney P.E.
- Year
- 2014
- Tongue
- English
- Leaves
- 521
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Table of Contents
1 How To Use This Book
1.1 Roadmap
1.2 Quick Quizzes
1.3 Alternatives to This Book
1.4 Sample Source Code
1.5 Whose Book Is This?
2 Introduction
2.1 Historic Parallel Programming Difficulties
2.2 Parallel Programming Goals
2.2.1 Performance
2.2.2 Productivity
2.2.3 Generality
2.3 Alternatives to Parallel Programming
2.3.1 Multiple Instances of a Sequential Application
2.3.2 Use Existing Parallel Software
2.3.3 Performance Optimization
2.4 What Makes Parallel Programming Hard?
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
2.4.3 Resource Partitioning and Replication
2.4.4 Interacting With Hardware
2.4.5 Composite Capabilities
2.4.6 How Do Languages and Environments Assist With These Tasks?
2.5 Discussion
3 Hardware and its Habits
3.1 Overview
3.1.1 Pipelined CPUs
3.1.2 Memory References
3.1.3 Atomic Operations
3.1.4 Memory Barriers
3.1.5 Cache Misses
3.1.6 I/O Operations
3.2 Overheads
3.2.1 Hardware System Architecture
3.2.2 Costs of Operations
3.3 Hardware Free Lunch?
3.3.1 3D Integration
3.3.2 Novel Materials and Processes
3.3.3 Light, Not Electrons
3.3.4 Special-Purpose Accelerators
3.3.5 Existing Parallel Software
3.4 Software Design Implications
4 Tools of the Trade
4.1 Scripting Languages
4.2 POSIX Multiprocessing
4.2.1 POSIX Process Creation and Destruction
4.2.2 POSIX Thread Creation and Destruction
4.2.3 POSIX Locking
4.2.4 POSIX Reader-Writer Locking
4.3 Atomic Operations
4.4 Linux-Kernel Equivalents to POSIX Operations
4.5 The Right Tool for the Job: How to Choose?
5 Counting
5.1 Why Isn't Concurrent Counting Trivial?
5.2 Statistical Counters
5.2.1 Design
5.2.2 Array-Based Implementation
5.2.3 Eventually Consistent Implementation
5.2.4 Per-Thread-Variable-Based Implementation
5.2.5 Discussion
5.3 Approximate Limit Counters
5.3.1 Design
5.3.2 Simple Limit Counter Implementation
5.3.3 Simple Limit Counter Discussion
5.3.4 Approximate Limit Counter Implementation
5.3.5 Approximate Limit Counter Discussion
5.4 Exact Limit Counters
5.4.1 Atomic Limit Counter Implementation
5.4.2 Atomic Limit Counter Discussion
5.4.3 Signal-Theft Limit Counter Design
5.4.4 Signal-Theft Limit Counter Implementation
5.4.5 Signal-Theft Limit Counter Discussion
5.5 Applying Specialized Parallel Counters
5.6 Parallel Counting Discussion
6 Partitioning and Synchronization Design
6.1 Partitioning Exercises
6.1.1 Dining Philosophers Problem
6.1.2 Double-Ended Queue
6.1.3 Partitioning Example Discussion
6.2 Design Criteria
6.3 Synchronization Granularity
6.3.1 Sequential Program
6.3.2 Code Locking
6.3.3 Data Locking
6.3.4 Data Ownership
6.3.5 Locking Granularity and Performance
6.4 Parallel Fastpath
6.4.1 Reader/Writer Locking
6.4.2 Hierarchical Locking
6.4.3 Resource Allocator Caches
6.5 Beyond Partitioning
6.5.1 Work-Queue Parallel Maze Solver
6.5.2 Alternative Parallel Maze Solver
6.5.3 Performance Comparison I
6.5.4 Alternative Sequential Maze Solver
6.5.5 Performance Comparison II
6.5.6 Future Directions and Conclusions
6.6 Partitioning, Parallelism, and Optimization
7 Locking
7.1 Staying Alive
7.1.1 Deadlock
7.1.2 Livelock and Starvation
7.1.3 Unfairness
7.1.4 Inefficiency
7.2 Types of Locks
7.2.1 Exclusive Locks
7.2.2 Reader-Writer Locks
7.2.3 Beyond Reader-Writer Locks
7.2.4 Scoped Locking
7.3 Locking Implementation Issues
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange
7.3.2 Other Exclusive-Locking Implementations
7.4 Lock-Based Existence Guarantees
7.5 Locking: Hero or Villain?
7.5.1 Locking For Applications: Hero!
7.5.2 Locking For Parallel Libraries: Just Another Tool
7.5.3 Locking For Parallelizing Sequential Libraries: Villain!
7.6 Summary
8 Data Ownership
8.1 Multiple Processes
8.2 Partial Data Ownership and pthreads
8.3 Function Shipping
8.4 Designated Thread
8.5 Privatization
8.6 Other Uses of Data Ownership
9 Deferred Processing
9.1 Reference Counting
9.1.1 Implementation of Reference-Counting Categories
9.1.2 Hazard Pointers
9.1.3 Linux Primitives Supporting Reference Counting
9.1.4 Counter Optimizations
9.2 Sequence Locks
9.3 Read-Copy Update (RCU)
9.3.1 Introduction to RCU
9.3.2 RCU Fundamentals
9.3.3 RCU Usage
9.3.4 RCU Linux-Kernel API
9.3.5 Toy'' RCU Implementations
9.3.6 RCU Exercises
9.4 Which to Choose?
9.5 What About Updates?
10 Data Structures
10.1 Motivating Application
10.2 Partitionable Data Structures
10.2.1 Hash-Table Design
10.2.2 Hash-Table Implementation
10.2.3 Hash-Table Performance
10.3 Read-Mostly Data Structures
10.3.1 RCU-Protected Hash Table Implementation
10.3.2 RCU-Protected Hash Table Performance
10.3.3 RCU-Protected Hash Table Discussion
10.4 Non-Partitionable Data Structures
10.4.1 Resizable Hash Table Design
10.4.2 Resizable Hash Table Implementation
10.4.3 Resizable Hash Table Discussion
10.4.4 Other Resizable Hash Tables
10.5 Other Data Structures
10.6 Micro-Optimization
10.6.1 Specialization
10.6.2 Bits and Bytes
10.6.3 Hardware Considerations
10.7 Summary
11 Validation
11.1 Introduction
11.1.1 Where Do Bugs Come From?
11.1.2 Required Mindset
11.1.3 When Should Validation Start?
11.1.4 The Open Source Way
11.2 Tracing
11.3 Assertions
11.4 Static Analysis
11.5 Code Review
11.5.1 Inspection
11.5.2 Walkthroughs
11.5.3 Self-Inspection
11.6 Probability and Heisenbugs
11.6.1 Statistics for Discrete Testing
11.6.2 Abusing Statistics for Discrete Testing
11.6.3 Statistics for Continuous Testing
11.6.4 Hunting Heisenbugs
11.7 Performance Estimation
11.7.1 Benchmarking
11.7.2 Profiling
11.7.3 Differential Profiling
11.7.4 Microbenchmarking
11.7.5 Isolation
11.7.6 Detecting Interference
11.8 Summary
12 Formal Verification
12.1 What are Promela and Spin?
12.2 Promela Example: Non-Atomic Increment
12.3 Promela Example: Atomic Increment
12.3.1 Combinatorial Explosion
12.4 How to Use Promela
12.4.1 Promela Peculiarities
12.4.2 Promela Coding Tricks
12.5 Promela Example: Locking
12.6 Promela Example: QRCU
12.6.1 Running the QRCU Example
12.6.2 How Many Readers and Updaters Are Really Needed?
12.6.3 Alternative Approach: Proof of Correctness
12.6.4 Alternative Approach: More Capable Tools
12.6.5 Alternative Approach: Divide and Conquer
12.7 Promela Parable: dynticks and Preemptible RCU
12.7.1 Introduction to Preemptible RCU and dynticks
12.7.2 Validating Preemptible RCU and dynticks
12.7.3 Lessons (Re)Learned
12.8 Simplicity Avoids Formal Verification
12.8.1 State Variables for Simplified Dynticks Interface
12.8.2 Entering and Leaving Dynticks-Idle Mode
12.8.3 NMIs From Dynticks-Idle Mode
12.8.4 Interrupts From Dynticks-Idle Mode
12.8.5 Checking For Dynticks Quiescent States
12.8.6 Discussion
12.9 Formal Verification and Memory Ordering
12.9.1 Anatomy of a Litmus Test
12.9.2 What Does This Litmus Test Mean?
12.9.3 Running a Litmus Test
12.9.4 CPPMEM Discussion
12.10 Summary
13 Putting It All Together
13.1 Counter Conundrums
13.1.1 Counting Updates
13.1.2 Counting Lookups
13.2 RCU Rescues
13.2.1 RCU and Per-Thread-Variable-Based Statistical Counters
13.2.2 RCU and Counters for Removable I/O Devices
13.2.3 Array and Length
13.2.4 Correlated Fields
13.3 Hashing Hassles
13.3.1 Correlated Data Elements
13.3.2 Update-Friendly Hash-Table Traversal
14 Advanced Synchronization
14.1 Avoiding Locks
14.2 Memory Barriers
14.2.1 Memory Ordering and Memory Barriers
14.2.2 If B Follows A, and C Follows B, Why Doesn't C Follow A?
14.2.3 Variables Can Have More Than One Value
14.2.4 What Can You Trust?
14.2.5 Review of Locking Implementations
14.2.6 A Few Simple Rules
14.2.7 Abstract Memory Access Model
14.2.8 Device Operations
14.2.9 Guarantees
14.2.10 What Are Memory Barriers?
14.2.11 Locking Constraints
14.2.12 Memory-Barrier Examples
14.2.13 The Effects of the CPU Cache
14.2.14 Where Are Memory Barriers Needed?
14.3 Non-Blocking Synchronization
14.3.1 Simple NBS
14.3.2 NBS Discussion
15 Ease of Use
15.1 What is Easy?
15.2 Rusty Scale for API Design
15.3 Shaving the Mandelbrot Set
16 Conflicting Visions of the Future
16.1 The Future of CPU Technology Ain't What it Used to Be
16.1.1 Uniprocessor Γber Alles
16.1.2 Multithreaded Mania
16.1.3 More of the Same
16.1.4 Crash Dummies Slamming into the Memory Wall
16.2 Transactional Memory
16.2.1 Outside World
16.2.2 Process Modification
16.2.3 Synchronization
16.2.4 Discussion
16.3 Hardware Transactional Memory
16.3.1 HTM Benefits WRT to Locking
16.3.2 HTM Weaknesses WRT Locking
16.3.3 HTM Weaknesses WRT to Locking When Augmented
16.3.4 Where Does HTM Best Fit In?
16.3.5 Potential Game Changers
16.3.6 Conclusions
16.4 Functional Programming for Parallelism
A Important Questions
A.1 What DoesAfter'' Mean?
A.2 What Time Is It?
B Synchronization Primitives
B.1 Organization and Initialization
B.1.1 smp_init():
B.2 Thread Creation, Destruction, and Control
B.2.1 create_thread()
B.2.2 smp_thread_id()
B.2.3 for_each_thread()
B.2.4 for_each_running_thread()
B.2.5 wait_thread()
B.2.6 wait_all_threads()
B.2.7 Example Usage
B.3 Locking
B.3.1 spin_lock_init()
B.3.2 spin_lock()
B.3.3 spin_trylock()
B.3.4 spin_unlock()
B.3.5 Example Usage
B.4 Per-Thread Variables
B.4.1 DEFINE_PER_THREAD()
B.4.2 DECLARE_PER_THREAD()
B.4.3 per_thread()
B.4.4 __get_thread_var()
B.4.5 init_per_thread()
B.4.6 Usage Example
B.5 Performance
C Why Memory Barriers?
C.1 Cache Structure
C.2 Cache-Coherence Protocols
C.2.1 MESI States
C.2.2 MESI Protocol Messages
C.2.3 MESI State Diagram
C.2.4 MESI Protocol Example
C.3 Stores Result in Unnecessary Stalls
C.3.1 Store Buffers
C.3.2 Store Forwarding
C.3.3 Store Buffers and Memory Barriers
C.4 Store Sequences Result in Unnecessary Stalls
C.4.1 Invalidate Queues
C.4.2 Invalidate Queues and Invalidate Acknowledge
C.4.3 Invalidate Queues and Memory Barriers
C.5 Read and Write Memory Barriers
C.6 Example Memory-Barrier Sequences
C.6.1 Ordering-Hostile Architecture
C.6.2 Example 1
C.6.3 Example 2
C.6.4 Example 3
C.7 Memory-Barrier Instructions For Specific CPUs
C.7.1 Alpha
C.7.2 AMD64
C.7.3 ARMv7-A/R
C.7.4 IA64
C.7.5 PA-RISC
C.7.6 POWER / PowerPC
C.7.7 SPARC RMO, PSO, and TSO
C.7.8 x86
C.7.9 zSeries
C.8 Are Memory Barriers Forever?
C.9 Advice to Hardware Designers
D Read-Copy Update Implementations
D.1 Sleepable RCU Implementation
D.1.1 SRCU Implementation Strategy
D.1.2 SRCU API and Usage
D.1.3 Implementation
D.1.4 SRCU Summary
D.2 Hierarchical RCU Overview
D.2.1 Review of RCU Fundamentals
D.2.2 Brief Overview of Classic RCU Implementation
D.2.3 RCU Desiderata
D.2.4 Towards a More Scalable RCU Implementation
D.2.5 Towards a Greener RCU Implementation
D.2.6 State Machine
D.2.7 Use Cases
D.2.8 Testing
D.2.9 Conclusion
D.3 Hierarchical RCU Code Walkthrough
D.3.1 Data Structures and Kernel Parameters
D.3.2 External Interfaces
D.3.3 Initialization
D.3.4 CPU Hotplug
D.3.5 Miscellaneous Functions
D.3.6 Grace-Period-Detection Functions
D.3.7 Dyntick-Idle Functions
D.3.8 Forcing Quiescent States
D.3.9 CPU-Stall Detection
D.3.10 Possible Flaws and Changes
D.4 Preemptible RCU
D.4.1 Conceptual RCU
D.4.2 Overview of Preemptible RCU Algorithm
D.4.3 Validation of Preemptible RCU
E Read-Copy Update in Linux
E.1 RCU Usage Within Linux
E.2 RCU Evolution
E.2.1 2.6.27 Linux Kernel
E.2.2 2.6.28 Linux Kernel
E.2.3 2.6.29 Linux Kernel
E.2.4 2.6.31 Linux Kernel
E.2.5 2.6.32 Linux Kernel
E.2.6 2.6.33 Linux Kernel
E.2.7 2.6.34 Linux Kernel
E.2.8 2.6.35 Linux Kernel
E.2.9 2.6.36 Linux Kernel
E.2.10 2.6.37 Linux Kernel
E.2.11 2.6.38 Linux Kernel
E.2.12 2.6.39 Linux Kernel
E.2.13 3.0 Linux Kernel
E.2.14 3.1 Linux Kernel
E.2.15 3.2 Linux Kernel
E.2.16 3.3 Linux Kernel
E.2.17 3.4 Linux Kernel
E.2.18 3.5 Linux Kernel
E.2.19 3.6 Linux Kernel
E.2.20 3.7 Linux Kernel
E.2.21 3.8 Linux Kernel
E.2.22 3.9 Linux Kernel
E.2.23 3.10 Linux Kernel
E.2.24 3.11 Linux Kernel
E.2.25 3.12 Linux Kernel
E.2.26 3.13 Linux Kernel
F Answers to Quick Quizzes
F.1 How To Use This Book
F.2 Introduction
F.3 Hardware and its Habits
F.4 Tools of the Trade
F.5 Counting
F.6 Partitioning and Synchronization Design
F.7 Locking
F.8 Data Ownership
F.9 Deferred Processing
F.10 Data Structures
F.11 Validation
F.12 Formal Verification
F.13 Putting It All Together
F.14 Advanced Synchronization
F.15 Ease of Use
F.16 Conflicting Visions of the Future
F.17 Important Questions
F.18 Synchronization Primitives
F.19 Why Memory Barriers?
F.20 Read-Copy Update Implementations
G Glossary and Bibliography
H Credits
H.1 Authors
H.2 Reviewers
H.3 Machine Owners
H.4 Original Publications
H.5 Figure Credits
H.6 Other Support
π SIMILAR VOLUMES
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
New releases appear at https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
<span>This book examines what makes parallel programming hard, and describes design techniques that can help you avoid many parallel-programming pitfalls. It is primarily intended for low-level C/C++ code, but offers valuable lessons for other environments as well.</span>