Is Parallel Programming Hard, And, If So, What Can You Do About It

✍ Scribed by McKenney P.E.

Year: 2014
Tongue: English
Leaves: 521
Category: Library

No coin nor oath required. For personal study only.

✦ Table of Contents

1 How To Use This Book
1.1 Roadmap
1.2 Quick Quizzes
1.3 Alternatives to This Book
1.4 Sample Source Code
1.5 Whose Book Is This?
2 Introduction
2.1 Historic Parallel Programming Difficulties
2.2 Parallel Programming Goals
2.2.1 Performance
2.2.2 Productivity
2.2.3 Generality
2.3 Alternatives to Parallel Programming
2.3.1 Multiple Instances of a Sequential Application
2.3.2 Use Existing Parallel Software
2.3.3 Performance Optimization
2.4 What Makes Parallel Programming Hard?
2.4.1 Work Partitioning
2.4.2 Parallel Access Control
2.4.3 Resource Partitioning and Replication
2.4.4 Interacting With Hardware
2.4.5 Composite Capabilities
2.4.6 How Do Languages and Environments Assist With These Tasks?
2.5 Discussion
3 Hardware and its Habits
3.1 Overview
3.1.1 Pipelined CPUs
3.1.2 Memory References
3.1.3 Atomic Operations
3.1.4 Memory Barriers
3.1.5 Cache Misses
3.1.6 I/O Operations
3.2 Overheads
3.2.1 Hardware System Architecture
3.2.2 Costs of Operations
3.3 Hardware Free Lunch?
3.3.1 3D Integration
3.3.2 Novel Materials and Processes
3.3.3 Light, Not Electrons
3.3.4 Special-Purpose Accelerators
3.3.5 Existing Parallel Software
3.4 Software Design Implications
4 Tools of the Trade
4.1 Scripting Languages
4.2 POSIX Multiprocessing
4.2.1 POSIX Process Creation and Destruction
4.2.2 POSIX Thread Creation and Destruction
4.2.3 POSIX Locking
4.2.4 POSIX Reader-Writer Locking
4.3 Atomic Operations
4.4 Linux-Kernel Equivalents to POSIX Operations
4.5 The Right Tool for the Job: How to Choose?
5 Counting
5.1 Why Isn't Concurrent Counting Trivial?
5.2 Statistical Counters
5.2.1 Design
5.2.2 Array-Based Implementation
5.2.3 Eventually Consistent Implementation
5.2.4 Per-Thread-Variable-Based Implementation
5.2.5 Discussion
5.3 Approximate Limit Counters
5.3.1 Design
5.3.2 Simple Limit Counter Implementation
5.3.3 Simple Limit Counter Discussion
5.3.4 Approximate Limit Counter Implementation
5.3.5 Approximate Limit Counter Discussion
5.4 Exact Limit Counters
5.4.1 Atomic Limit Counter Implementation
5.4.2 Atomic Limit Counter Discussion
5.4.3 Signal-Theft Limit Counter Design
5.4.4 Signal-Theft Limit Counter Implementation
5.4.5 Signal-Theft Limit Counter Discussion
5.5 Applying Specialized Parallel Counters
5.6 Parallel Counting Discussion
6 Partitioning and Synchronization Design
6.1 Partitioning Exercises
6.1.1 Dining Philosophers Problem
6.1.2 Double-Ended Queue
6.1.3 Partitioning Example Discussion
6.2 Design Criteria
6.3 Synchronization Granularity
6.3.1 Sequential Program
6.3.2 Code Locking
6.3.3 Data Locking
6.3.4 Data Ownership
6.3.5 Locking Granularity and Performance
6.4 Parallel Fastpath
6.4.1 Reader/Writer Locking
6.4.2 Hierarchical Locking
6.4.3 Resource Allocator Caches
6.5 Beyond Partitioning
6.5.1 Work-Queue Parallel Maze Solver
6.5.2 Alternative Parallel Maze Solver
6.5.3 Performance Comparison I
6.5.4 Alternative Sequential Maze Solver
6.5.5 Performance Comparison II
6.5.6 Future Directions and Conclusions
6.6 Partitioning, Parallelism, and Optimization
7 Locking
7.1 Staying Alive
7.1.1 Deadlock
7.1.2 Livelock and Starvation
7.1.3 Unfairness
7.1.4 Inefficiency
7.2 Types of Locks
7.2.1 Exclusive Locks
7.2.2 Reader-Writer Locks
7.2.3 Beyond Reader-Writer Locks
7.2.4 Scoped Locking
7.3 Locking Implementation Issues
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange
7.3.2 Other Exclusive-Locking Implementations
7.4 Lock-Based Existence Guarantees
7.5 Locking: Hero or Villain?
7.5.1 Locking For Applications: Hero!
7.5.2 Locking For Parallel Libraries: Just Another Tool
7.5.3 Locking For Parallelizing Sequential Libraries: Villain!
7.6 Summary
8 Data Ownership
8.1 Multiple Processes
8.2 Partial Data Ownership and pthreads
8.3 Function Shipping
8.4 Designated Thread
8.5 Privatization
8.6 Other Uses of Data Ownership
9 Deferred Processing
9.1 Reference Counting
9.1.1 Implementation of Reference-Counting Categories
9.1.2 Hazard Pointers
9.1.3 Linux Primitives Supporting Reference Counting
9.1.4 Counter Optimizations
9.2 Sequence Locks
9.3 Read-Copy Update (RCU)
9.3.1 Introduction to RCU
9.3.2 RCU Fundamentals
9.3.3 RCU Usage
9.3.4 RCU Linux-Kernel API
9.3.5 Toy'' RCU Implementations 9.3.6 RCU Exercises 9.4 Which to Choose? 9.5 What About Updates? 10 Data Structures 10.1 Motivating Application 10.2 Partitionable Data Structures 10.2.1 Hash-Table Design 10.2.2 Hash-Table Implementation 10.2.3 Hash-Table Performance 10.3 Read-Mostly Data Structures 10.3.1 RCU-Protected Hash Table Implementation 10.3.2 RCU-Protected Hash Table Performance 10.3.3 RCU-Protected Hash Table Discussion 10.4 Non-Partitionable Data Structures 10.4.1 Resizable Hash Table Design 10.4.2 Resizable Hash Table Implementation 10.4.3 Resizable Hash Table Discussion 10.4.4 Other Resizable Hash Tables 10.5 Other Data Structures 10.6 Micro-Optimization 10.6.1 Specialization 10.6.2 Bits and Bytes 10.6.3 Hardware Considerations 10.7 Summary 11 Validation 11.1 Introduction 11.1.1 Where Do Bugs Come From? 11.1.2 Required Mindset 11.1.3 When Should Validation Start? 11.1.4 The Open Source Way 11.2 Tracing 11.3 Assertions 11.4 Static Analysis 11.5 Code Review 11.5.1 Inspection 11.5.2 Walkthroughs 11.5.3 Self-Inspection 11.6 Probability and Heisenbugs 11.6.1 Statistics for Discrete Testing 11.6.2 Abusing Statistics for Discrete Testing 11.6.3 Statistics for Continuous Testing 11.6.4 Hunting Heisenbugs 11.7 Performance Estimation 11.7.1 Benchmarking 11.7.2 Profiling 11.7.3 Differential Profiling 11.7.4 Microbenchmarking 11.7.5 Isolation 11.7.6 Detecting Interference 11.8 Summary 12 Formal Verification 12.1 What are Promela and Spin? 12.2 Promela Example: Non-Atomic Increment 12.3 Promela Example: Atomic Increment 12.3.1 Combinatorial Explosion 12.4 How to Use Promela 12.4.1 Promela Peculiarities 12.4.2 Promela Coding Tricks 12.5 Promela Example: Locking 12.6 Promela Example: QRCU 12.6.1 Running the QRCU Example 12.6.2 How Many Readers and Updaters Are Really Needed? 12.6.3 Alternative Approach: Proof of Correctness 12.6.4 Alternative Approach: More Capable Tools 12.6.5 Alternative Approach: Divide and Conquer 12.7 Promela Parable: dynticks and Preemptible RCU 12.7.1 Introduction to Preemptible RCU and dynticks 12.7.2 Validating Preemptible RCU and dynticks 12.7.3 Lessons (Re)Learned 12.8 Simplicity Avoids Formal Verification 12.8.1 State Variables for Simplified Dynticks Interface 12.8.2 Entering and Leaving Dynticks-Idle Mode 12.8.3 NMIs From Dynticks-Idle Mode 12.8.4 Interrupts From Dynticks-Idle Mode 12.8.5 Checking For Dynticks Quiescent States 12.8.6 Discussion 12.9 Formal Verification and Memory Ordering 12.9.1 Anatomy of a Litmus Test 12.9.2 What Does This Litmus Test Mean? 12.9.3 Running a Litmus Test 12.9.4 CPPMEM Discussion 12.10 Summary 13 Putting It All Together 13.1 Counter Conundrums 13.1.1 Counting Updates 13.1.2 Counting Lookups 13.2 RCU Rescues 13.2.1 RCU and Per-Thread-Variable-Based Statistical Counters 13.2.2 RCU and Counters for Removable I/O Devices 13.2.3 Array and Length 13.2.4 Correlated Fields 13.3 Hashing Hassles 13.3.1 Correlated Data Elements 13.3.2 Update-Friendly Hash-Table Traversal 14 Advanced Synchronization 14.1 Avoiding Locks 14.2 Memory Barriers 14.2.1 Memory Ordering and Memory Barriers 14.2.2 If B Follows A, and C Follows B, Why Doesn't C Follow A? 14.2.3 Variables Can Have More Than One Value 14.2.4 What Can You Trust? 14.2.5 Review of Locking Implementations 14.2.6 A Few Simple Rules 14.2.7 Abstract Memory Access Model 14.2.8 Device Operations 14.2.9 Guarantees 14.2.10 What Are Memory Barriers? 14.2.11 Locking Constraints 14.2.12 Memory-Barrier Examples 14.2.13 The Effects of the CPU Cache 14.2.14 Where Are Memory Barriers Needed? 14.3 Non-Blocking Synchronization 14.3.1 Simple NBS 14.3.2 NBS Discussion 15 Ease of Use 15.1 What is Easy? 15.2 Rusty Scale for API Design 15.3 Shaving the Mandelbrot Set 16 Conflicting Visions of the Future 16.1 The Future of CPU Technology Ain't What it Used to Be 16.1.1 Uniprocessor Über Alles 16.1.2 Multithreaded Mania 16.1.3 More of the Same 16.1.4 Crash Dummies Slamming into the Memory Wall 16.2 Transactional Memory 16.2.1 Outside World 16.2.2 Process Modification 16.2.3 Synchronization 16.2.4 Discussion 16.3 Hardware Transactional Memory 16.3.1 HTM Benefits WRT to Locking 16.3.2 HTM Weaknesses WRT Locking 16.3.3 HTM Weaknesses WRT to Locking When Augmented 16.3.4 Where Does HTM Best Fit In? 16.3.5 Potential Game Changers 16.3.6 Conclusions 16.4 Functional Programming for Parallelism A Important Questions A.1 What DoesAfter'' Mean?
A.2 What Time Is It?
B Synchronization Primitives
B.1 Organization and Initialization
B.1.1 smp_init():
B.2 Thread Creation, Destruction, and Control
B.2.1 create_thread()
B.2.2 smp_thread_id()
B.2.3 for_each_thread()
B.2.4 for_each_running_thread()
B.2.5 wait_thread()
B.2.6 wait_all_threads()
B.2.7 Example Usage
B.3 Locking
B.3.1 spin_lock_init()
B.3.2 spin_lock()
B.3.3 spin_trylock()
B.3.4 spin_unlock()
B.3.5 Example Usage
B.4 Per-Thread Variables
B.4.1 DEFINE_PER_THREAD()
B.4.2 DECLARE_PER_THREAD()
B.4.3 per_thread()
B.4.4 __get_thread_var()
B.4.5 init_per_thread()
B.4.6 Usage Example
B.5 Performance
C Why Memory Barriers?
C.1 Cache Structure
C.2 Cache-Coherence Protocols
C.2.1 MESI States
C.2.2 MESI Protocol Messages
C.2.3 MESI State Diagram
C.2.4 MESI Protocol Example
C.3 Stores Result in Unnecessary Stalls
C.3.1 Store Buffers
C.3.2 Store Forwarding
C.3.3 Store Buffers and Memory Barriers
C.4 Store Sequences Result in Unnecessary Stalls
C.4.1 Invalidate Queues
C.4.2 Invalidate Queues and Invalidate Acknowledge
C.4.3 Invalidate Queues and Memory Barriers
C.5 Read and Write Memory Barriers
C.6 Example Memory-Barrier Sequences
C.6.1 Ordering-Hostile Architecture
C.6.2 Example 1
C.6.3 Example 2
C.6.4 Example 3
C.7 Memory-Barrier Instructions For Specific CPUs
C.7.1 Alpha
C.7.2 AMD64
C.7.3 ARMv7-A/R
C.7.4 IA64
C.7.5 PA-RISC
C.7.6 POWER / PowerPC
C.7.7 SPARC RMO, PSO, and TSO
C.7.8 x86
C.7.9 zSeries
C.8 Are Memory Barriers Forever?
C.9 Advice to Hardware Designers
D Read-Copy Update Implementations
D.1 Sleepable RCU Implementation
D.1.1 SRCU Implementation Strategy
D.1.2 SRCU API and Usage
D.1.3 Implementation
D.1.4 SRCU Summary
D.2 Hierarchical RCU Overview
D.2.1 Review of RCU Fundamentals
D.2.2 Brief Overview of Classic RCU Implementation
D.2.3 RCU Desiderata
D.2.4 Towards a More Scalable RCU Implementation
D.2.5 Towards a Greener RCU Implementation
D.2.6 State Machine
D.2.7 Use Cases
D.2.8 Testing
D.2.9 Conclusion
D.3 Hierarchical RCU Code Walkthrough
D.3.1 Data Structures and Kernel Parameters
D.3.2 External Interfaces
D.3.3 Initialization
D.3.4 CPU Hotplug
D.3.5 Miscellaneous Functions
D.3.6 Grace-Period-Detection Functions
D.3.7 Dyntick-Idle Functions
D.3.8 Forcing Quiescent States
D.3.9 CPU-Stall Detection
D.3.10 Possible Flaws and Changes
D.4 Preemptible RCU
D.4.1 Conceptual RCU
D.4.2 Overview of Preemptible RCU Algorithm
D.4.3 Validation of Preemptible RCU
E Read-Copy Update in Linux
E.1 RCU Usage Within Linux
E.2 RCU Evolution
E.2.1 2.6.27 Linux Kernel
E.2.2 2.6.28 Linux Kernel
E.2.3 2.6.29 Linux Kernel
E.2.4 2.6.31 Linux Kernel
E.2.5 2.6.32 Linux Kernel
E.2.6 2.6.33 Linux Kernel
E.2.7 2.6.34 Linux Kernel
E.2.8 2.6.35 Linux Kernel
E.2.9 2.6.36 Linux Kernel
E.2.10 2.6.37 Linux Kernel
E.2.11 2.6.38 Linux Kernel
E.2.12 2.6.39 Linux Kernel
E.2.13 3.0 Linux Kernel
E.2.14 3.1 Linux Kernel
E.2.15 3.2 Linux Kernel
E.2.16 3.3 Linux Kernel
E.2.17 3.4 Linux Kernel
E.2.18 3.5 Linux Kernel
E.2.19 3.6 Linux Kernel
E.2.20 3.7 Linux Kernel
E.2.21 3.8 Linux Kernel
E.2.22 3.9 Linux Kernel
E.2.23 3.10 Linux Kernel
E.2.24 3.11 Linux Kernel
E.2.25 3.12 Linux Kernel
E.2.26 3.13 Linux Kernel
F Answers to Quick Quizzes
F.1 How To Use This Book
F.2 Introduction
F.3 Hardware and its Habits
F.4 Tools of the Trade
F.5 Counting
F.6 Partitioning and Synchronization Design
F.7 Locking
F.8 Data Ownership
F.9 Deferred Processing
F.10 Data Structures
F.11 Validation
F.12 Formal Verification
F.13 Putting It All Together
F.14 Advanced Synchronization
F.15 Ease of Use
F.16 Conflicting Visions of the Future
F.17 Important Questions
F.18 Synchronization Primitives
F.19 Why Memory Barriers?
F.20 Read-Copy Update Implementations
G Glossary and Bibliography
H Credits
H.1 Authors
H.2 Reviewers
H.3 Machine Owners
H.4 Original Publications
H.5 Figure Credits
H.6 Other Support

📜 SIMILAR VOLUMES

Is parallel programming hard, and if so,

📁 Is parallel programming hard, and if so, what can you do about it

✍ McKenney P.E. 📂 Library 📅 2011 🌐 English

Is Parallel Programming Hard, And, If So

📁 Is Parallel Programming Hard, And, If So, What Can You Do About It?

✍ Paul E. McKenney 📂 Library 📅 2021 🌐 English

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Is parallel programming hard, and if so,

📁 Is parallel programming hard, and if so, what can you do about it

✍ McKenney P.E. 📂 Library 📅 2017 🌐 English

New releases appear at https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Is Parallel Programming Hard, And, If So

📁 Is Parallel Programming Hard, And, If So, What Can You Do About It?

✍ Paul E. McKenney 📂 Library 📅 2017 🌐 English

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Is Parallel Programming Hard, And, If So

📁 Is Parallel Programming Hard, And, If So, What Can You Do About It?

✍ Paul E. McKenney 📂 Library 📅 2023 🏛 Self Published 🌐 English

<span>This book examines what makes parallel programming hard, and describes design techniques that can help you avoid many parallel-programming pitfalls. It is primarily intended for low-level C/C++ code, but offers valuable lessons for other environments as well.</span>

Is Parallel Programming Hard, And, If So

📁 Is Parallel Programming Hard, And, If So, What Can You Do About It? (2011-01-02 draft)

✍ Paul E. McKenney 📂 Library 📅 2011 🌐 English