Programming on Parallel Machines: GPU, Multicore, Clusters and More

✍ Scribed by Norm Matloff

Year: 2013
Tongue: English
Leaves: 335
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

"Why is this book different from all other parallel programming books?"

Suitable for either students or professionals.

Practical viewpoint:

    There is very little theoretical analysis of parallel algorithms, such as O() analysis, maximum theoretical speedup, acyclic graphs and so on.

    Extensive coverage of "wizardry" aspects, i.e. material known to experienced practitioners but generally not in books, such as coverage of loop iteration scheduling, memory effects of storing large arrays and so on.

    Appendices cover systems background, crucial in applied work but always just "assumed" to be knowledge possessed by the readers.

    Considerable attention is paid to techniques for debugging.

Uses the main parallel platforms---OpenMP, CUDA and MPI---rather than languages that at this stage are largely experimental, such as the elegant-but-not-yet-mainstream Cilk.

Starts with real parallel code right away in Chapter 1, with examples from pthreads, OpenMP and MPI.

Constantly evolving: Like all my open source textbooks, this one is constantly evolving. I continue to add new topics, new examples, more timing analyses, and so on, and of course fix bugs and improve the exposition.

Prerequisites: The student must be reasonably adept in programming, and have math background through linear algebra. (An appendix to the book reviews the parts of the latter needed for this book.)

✦ Table of Contents

Execution Speed......Page 21
Our Focus Here......Page 22
Multiprocessor Topologies......Page 23
Basic Architecture......Page 24
Example: Matrix-Vector Multiply......Page 25
Programmer View......Page 26
Example: Pthreads Prime Numbers Finder......Page 27
Role of the OS......Page 32
Example: Sampling Bucket Sort......Page 33
Programmer View......Page 36
Example: MPI Prime Numbers Finder......Page 37
R snow Package......Page 40
Communication Bottlenecks......Page 45
What People Mean by Embarrassingly Parallel''......Page 46 Iterative Algorithms......Page 47 Example: Matrix-Vector Multiply......Page 48 Load Balance, Revisited......Page 50 Example: Mutual Web Outlinks......Page 51 Latency and Bandwidth......Page 52 Relative Merits: Performance of Shared-Memory Vs. Message-Passing......Page 53 Issues Particular to Shared-Memory Systems......Page 54 What Is Shared?......Page 55 Interleaving......Page 56 Bank Conflicts and Solutions......Page 57 Example: Code to Implement Padding......Page 59 SMP Systems......Page 60 NUMA Systems......Page 61 Crossbar Interconnects......Page 62 Omega (or Delta) Interconnects......Page 64 Comparative Analysis......Page 65 Why Have Memory in Modules?......Page 66 LOCK Prefix on Intel Processors......Page 67 Fetch-and-Add Instructions......Page 69 Cache Coherency......Page 70 Example: the MESI Cache Coherency Protocol......Page 73 Memory-Access Consistency Policies......Page 75 Optimal Number of Threads......Page 78 Software Distributed Shared Memory......Page 79 Case Study: JIAJIA......Page 82 Barrier Implementation......Page 85 An Attempt to Write a Reusable Version......Page 86 Use of Wait Operations......Page 87 Butterfly Barriers......Page 89 Example: Dijkstra Shortest-Path Algorithm......Page 91 The OpenMP parallel Pragma......Page 94 Scope Issues......Page 95 The OpenMP barrier Pragma......Page 96 Example: Dijkstra with Parallel for Loops......Page 97 Controlling the Partitioning of Work to Threads: the schedule Clause......Page 100 Example: In-Place Matrix Transpose......Page 102 The OpenMP reduction Clause......Page 103 Example: Mandelbrot Set......Page 104 The Task Directive......Page 107 Example: Quicksort......Page 108 The OpenMP atomic Clause......Page 109 Memory Consistency and the flush Pragma......Page 110 Compiling......Page 111 Debugging......Page 112 The Effect of Problem Size......Page 113 Some Fine Tuning......Page 114 OpenMP Internals......Page 117 Example: Root Finding......Page 118 Example: Mutual Outlinks......Page 120 Example: Transforming an Adjacency Matrix......Page 121 Other Examples of OpenMP Code in This Book......Page 124 Overview......Page 127 Example: Calculate Row Sums......Page 128 SIMT Architecture......Page 132 OS in Hardware''......Page 133
Shared and Global Memory......Page 134
Global-Memory Performance Issues......Page 137
Host/Device Memory Transfer Performance Issues......Page 138
Other Types of Memory......Page 139
Threads Hierarchy......Page 140
What's NOT There......Page 142
Synchronization, Within and Between Blocks......Page 143
Hardware Requirements, Installation, Compilation, Debugging......Page 144
Example: Improving the Row Sums Program......Page 146
Example: Finding the Mean Number of Mutual Outlinks......Page 148
Example: Finding Prime Numbers......Page 149
Example: Finding Cumulative Sums......Page 152
Example: Transforming an Adjacency Matrix......Page 153
Error Checking......Page 156
Short Vectors......Page 157
CUBLAS......Page 158
Example: Row Sums Once Again......Page 159
Other CUDA Examples in This Book......Page 161
Compiling to OpenMP......Page 163
Example: Counting the Number of Unique Values in an Array......Page 164
Example: A Plain-C Wrapper for Thrust sort()......Page 168
Example: Calculating Percentiles in an Array......Page 169
Example: Doubling Every kth Element of an Array......Page 171
Scatter and Gather Operations......Page 173
Example: Matrix Transpose......Page 174
Example: Matrix Transpose Again......Page 175
A Timing Comparison......Page 177
Example: Transforming an Adjacency Matrix......Page 181
Prefix Scan......Page 183
Error Messages......Page 184
Other Examples of Thrust Code in This Book......Page 185
Overview......Page 187
Definitions......Page 188
The Network Is Literally the Weakest Link......Page 190
Scatter/Gather Operations......Page 191
History......Page 193
Performance Issues......Page 194
The Algorithm......Page 195
The MPI Code......Page 196
MPI_Comm_size() and MPI_Comm_rank()......Page 199
MPI_Send()......Page 200
MPI_Recv()......Page 201
Example: Removing 0s from an Array......Page 202
Debugging MPI Code......Page 203
Example: Refined Dijkstra Code......Page 204
MPI_Bcast()......Page 207
MPI_Reduce()/MPI_Allreduce()......Page 208
The MPI_Scatter()......Page 209
Example: Cumulative Sums......Page 210
Example: an MPI Solution to the Mutual Outlinks Problem......Page 212
Creating Communicators......Page 214
Buffering, Etc.......Page 215
Safety......Page 216
Use of MPI from Other Languages......Page 217
Other MPI Examples in This Book......Page 218
Cloud Computing......Page 219
Overview of Operations......Page 220
Example: Word Count......Page 221
Example: Maximum Air Temperature by Year......Page 222
Role of Disk Files......Page 223
Running Hadoop......Page 224
Example: Transforming an Adjacency Graph......Page 225
Example: Identifying Outliers......Page 228
Debugging Hadoop Streaming Programs......Page 231
It's a Lot More Than Just Programming......Page 232
Example: Permutations......Page 233
General Strategies for Parallel Scan Computation......Page 234
Example: Parallel Prefix, Run-Length Decoding in OpenMP......Page 237
Example: Run-Length Decompression in Thrust......Page 239
Partitioned Matrices......Page 241
Message-Passing Case......Page 243
Fox's Algorithm......Page 244
Example: Matrix Multiply in OpenMP......Page 245
Example: Matrix Multiply in CUDA......Page 246
Example: Graph Connectedness......Page 249
Example: Matrix Inversion......Page 250
Solving Systems of Linear Equations......Page 251
Gaussian Elimination......Page 252
Example: Gaussian Elimination in CUDA......Page 253
The Jacobi Algorithm......Page 254
Example: OpenMP Implementation of the Jacobi Algorithm......Page 255
The Power Method......Page 256
Parallel Computation......Page 257
Sparse Matrices......Page 258
Libraries......Page 259
The Separation Process......Page 261
Example: OpenMP Quicksort......Page 263
Hyperquicksort......Page 264
Message Passing Mergesort on a Tree Topology......Page 265
Bitonic Mergesort......Page 266
The Much-Maligned Bubble Sort......Page 268
Example: CUDA Implementation of Odd/Even Transposition Sort......Page 269
Bucket Sort with Sampling......Page 271
Enumeration Sort......Page 275
One-Dimensional Fourier Series......Page 277
Discrete Fourier Transforms......Page 281
One-Dimensional Data......Page 282
Inversion......Page 283
Two-Dimensional Data......Page 284
The Fast Fourier Transform......Page 285
Parallelizing Computation of the Two-Dimensional Transform......Page 286
FFTW......Page 287
Example: Audio Smoothing in R......Page 288
Edge Detection......Page 289
Keeping the Pixel Intensities in the Proper Range......Page 290
Vector Space Issues (optional section)......Page 291
Bandwidth: How to Read the San Francisco Chronicle Business Page (optional section)......Page 293
What Is It?......Page 295
The Market Basket Problem......Page 296
Serial Algorithms......Page 297
Probability Density Estimation......Page 298
Kernel-Based Density Estimation......Page 299
Histogram Computation for Images......Page 302
Clustering......Page 303
Example: k-Means Clustering in R......Page 305
Principal Component Analysis (PCA)......Page 306
Monte Carlo Simulation......Page 307
Many Processes, Taking Turns......Page 309
Make Sure You Understand the Goals......Page 311
How It Works......Page 312
Storage......Page 313
Memory Allocation......Page 314
Terminology and Notation......Page 317
Matrix Addition and Multiplication......Page 318
Linear Independence......Page 319
Eigenvalues and Eigenvectors......Page 320
Correspondences......Page 323
First Sample Programming Session......Page 324
Second Sample Programming Session......Page 327
Third Sample Programming Session......Page 329
The Reduce() Function......Page 330
S3 Classes......Page 331
Handy Utilities......Page 332
Graphics......Page 333
Online Help......Page 334
Complex Numbers......Page 335

📜 SIMILAR VOLUMES

Parallel Programming: for Multicore and

📁 Parallel Programming: for Multicore and Cluster Systems

✍ Thomas Rauber, Gudula Rünger (auth.) 📂 Library 📅 2010 🏛 Springer Berlin Heidelberg 🌐 English

Innovations in hardware architecture, like hyper-threading or multicore processors, mean that parallel computing resources are available for inexpensive desktop computers. In only a few years, many standard software products will be based on concepts of parallel programming implemented on such ha

Parallel programming: for multicore and

📁 Parallel programming: for multicore and cluster systems

✍ Thomas Rauber, Gudula Rünger (auth.) 📂 Library 📅 2010 🏛 Springer Berlin Heidelberg 🌐 English

Innovations in hardware architecture, like hyper-threading or multicore processors, mean that parallel computing resources are available for inexpensive desktop computers. In only a few years, many standard software products will be based on concepts of parallel programming implemented on such hardw

Parallel Programming: for Multicore and

📁 Parallel Programming: for Multicore and Cluster Systems

✍ Thomas Rauber, Gudula Rünger (auth.) 📂 Library 📅 2013 🏛 Springer-Verlag Berlin Heidelberg 🌐 English

Parallel Programming: for Multicore and

📁 Parallel Programming: for Multicore and Cluster Systems

✍ Thomas Rauber, Gudula Rünger (auth.) 📂 Library 📅 2010 🌐 English

Parallel Programming: for Multicore and

📁 Parallel Programming: for Multicore and Cluster Systems

✍ Thomas Rauber; Gudula Rünger 📂 Library 📅 2023 🏛 Springer 🌐 English

This textbook covers the new development in processor architecture and parallel hardware. It provides detailed descriptions of parallel programming techniques that are necessary for developing efficient programs for multicore processors as well as for parallel cluster systems and supercomputers. The

Parallel Programming: for Multicore and

📁 Parallel Programming: for Multicore and Cluster Systems

✍ Thomas Rauber; Gudula Rünger 📂 Library 📅 2023 🏛 Springer 🌐 English