<p><i>High Performance Parallelism Pearls</i> shows how to leverage parallelism on processors and coprocessors with the same programming - illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other mul
High performance parallelism pearls: multicore and many-core programming approaches. 2
β Scribed by Jeffers, Jim;Reinders, James
- Publisher
- Morgan Kaufmann Publishers
- Year
- 2015
- Tongue
- English
- Leaves
- 574
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
High Performance Parallelism Pearls Volume 2offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating success stories demonstrating not just the features of Xeon-powered systems, but also how to leverage parallelism across these heterogeneous systems.
β¦ Table of Contents
Front Cover......Page 1
High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches......Page 4
Copyright......Page 5
Contents......Page 6
Contributors......Page 18
Acknowledgments......Page 44
Making a bet on many-core......Page 46
HPC journey and revelation......Page 47
This book is timely and important......Page 48
Inspired by 61 cores: A new era in programming......Page 50
Applications and techniques......Page 52
OpenMP and nested parallelism......Page 53
Tuning prefetching......Page 54
The future of many-core......Page 55
For more information......Page 56
Numerical weather prediction: Background and motivation......Page 58
WSM6 in the NIM......Page 59
Shared-memory parallelism and controlling horizontal vector length......Page 61
Array alignment......Page 63
Compile-time constants for loop and array bounds......Page 64
Performance improvements......Page 70
Summary......Page 73
For more information......Page 74
The motivation and background......Page 76
Benchmark setup......Page 77
Code optimization......Page 81
Removal of the vertical dimension from temporary variables for a reduced memory footprint......Page 82
Collapse i- and j-loops into smaller cells for smaller footprint per thread......Page 83
Addition of vector alignment directives......Page 86
Summary of the code optimizations......Page 88
VTune performance metrics......Page 89
Summary......Page 90
For more information......Page 91
Pairwise sequence alignment......Page 94
Parallelization on a single coprocessor......Page 95
Multi-threading using OpenMP......Page 96
Vectorization using SIMD intrinsics......Page 97
Parallelization across multiple coprocessors using MPI......Page 99
Performance results......Page 100
Summary......Page 103
For more information......Page 104
Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery......Page 106
Parallelism enables proteome-scale structural bioinformatics......Page 107
Benchmarking dataset......Page 108
Code profiling......Page 109
Porting eFindSite for coprocessor offload......Page 111
Parallel version for a multicore processor......Page 116
Task-level scheduling for processor and coprocessor......Page 117
Summary......Page 121
For more information......Page 123
Chapter 6: Amber PME Molecular Dynamics Optimization......Page 124
Theory of MD......Page 125
Acceleration of neighbor list building using the coprocessor......Page 127
Acceleration of direct space sum using the coprocessor......Page 128
Exclusion list optimization......Page 130
Reduce data transfer and computation in offload code......Page 132
PME direct space sum and neighbor list work......Page 133
PME reciprocal space sum work......Page 135
Results......Page 136
Conclusions......Page 138
For more information......Page 139
The opportunity......Page 142
Packet processing architecture......Page 144
The symmetric communication interface......Page 145
Memory registration......Page 146
Mapping remote memory via scif_mmap()......Page 149
Optimization #1: The right API for the job......Page 150
Optimization #2: Benefit from write combining (WC) memory type......Page 153
Optimization #4: "Shadow" pointers for efficient FIFO management......Page 156
Optimization #5: Tickless kernels......Page 158
Optimization #7: Miscellaneous optimizations......Page 159
Results......Page 160
Conclusions......Page 161
For more information......Page 162
Introduction......Page 164
Pricing equation for American option......Page 165
Initial C/C++ implementation......Page 166
Floating point numeric operation control......Page 167
Reuse as much as possible and reinvent as little as possible......Page 170
Subexpression evaluation......Page 171
Define and use vector data......Page 173
Vector arithmetic operations......Page 174
Branch statements......Page 175
Calling the vector version and the scalar version of the program......Page 177
Comparing vector and scalar version......Page 179
C/C++ vector extension versus #pragma SIMD......Page 180
Thread parallelization......Page 181
Memory allocation in NUMA system......Page 183
Scale from multicore to many-core......Page 184
For more information......Page 187
Chapter 9: Wilson Dslash Kernel from Lattice QCD Optimization......Page 190
The Wilson-Dslash kernel......Page 191
Performance expectations......Page 194
Additional tricksβcompression......Page 195
First implementation and performance......Page 197
Running the naive code on Intel Xeon Phi coprocessor......Page 200
Evaluation of the naive code......Page 201
Data layout for vectorization......Page 202
3.5D blocking......Page 204
Load balancing......Page 205
SMT threading......Page 206
Lattice traversal......Page 207
Code generation with QphiX-Codegen......Page 208
Implementing the instructions......Page 210
Generating Dslash......Page 211
Prefetching......Page 213
Generating the code......Page 214
Performance results for QPhiX......Page 215
Other benefits......Page 217
The end of the road?......Page 219
For more information......Page 220
Analyzing the CMB with Modal......Page 222
Splitting the loop into parallel tasks......Page 225
Introducing nested parallelism......Page 229
Nested OpenMP parallel regions......Page 230
OpenMP 4.0 teams......Page 231
Manual nesting......Page 232
Inner loop optimization......Page 233
Results......Page 235
Comparison of nested parallelism approaches......Page 237
For more information......Page 240
Chapter 11: Visual Search Optimization......Page 242
Image acquisition and processing......Page 243
Scale-space extrema detection......Page 245
Keypoint matching......Page 246
Applications......Page 247
A study of parallelism in the visual search application......Page 248
Database (DB) level parallelism......Page 249
Flann library parallelism......Page 250
Database threads scaling......Page 253
KD-tree scaling with dbthreads......Page 255
For more information......Page 259
Chapter 12: Radio Frequency Ray Tracing......Page 262
Background......Page 263
StingRay system architecture......Page 264
Optimization examples......Page 268
Parallel RF simulation with OpenMP......Page 269
Parallel RF visualization with ispc......Page 275
Summary......Page 277
For more information......Page 278
The Uintah computational framework......Page 280
Cross-compiling the UCF......Page 281
Exploring thread affinity patterns......Page 285
Thread placement with PThreads......Page 286
Implementing scatter affinity with PThreads......Page 287
Simulation configuration......Page 288
Coprocessor-side results......Page 289
Further analysis......Page 291
Summary......Page 292
For more information......Page 293
Background......Page 294
Design of pyMIC......Page 295
The high-level interface......Page 296
The low-level interface......Page 299
Example: singular value decomposition......Page 300
Overview......Page 303
DFT algorithm......Page 304
Offloading......Page 306
Runtime code generation......Page 311
Offloading......Page 312
Performance......Page 313
Performance of pyMIC......Page 315
GPAW......Page 316
PyFR......Page 318
For more information......Page 319
The challenge of heterogeneous computing......Page 322
Matrix multiply......Page 323
Tiling for task concurrency......Page 324
Pipelining within a stream......Page 326
Stream concurrency within a computing domain......Page 327
Small matrix performance......Page 328
Trade-offs in degree of tiling and number of streams......Page 331
Tiled hStreams algorithm......Page 333
The hStreams library and framework......Page 335
Features......Page 337
How it works......Page 339
Related work......Page 340
Cholesky factorization......Page 341
Performance......Page 345
LU factorization......Page 347
Recap......Page 348
Summary......Page 349
For more information......Page 350
Tiled hStreams matrix multiplier example source......Page 352
Motivation......Page 356
When to use MPI interprocess shared memory......Page 357
1-D ring: from MPI messaging to shared memory......Page 358
Modifying MPPTEST halo exchange to include MPI SHM......Page 362
Evaluation environment and results......Page 364
Summary......Page 369
For more information......Page 370
Coarse-grained versus fine-grained parallelism......Page 372
Fine-grained OpenMP code......Page 374
Partial coarse-grained OpenMP code......Page 375
Fully coarse-grained OpenMP code......Page 376
Performance results with the stencil code......Page 380
Parallelism in numerical weather prediction models......Page 383
Summary......Page 384
For More Information......Page 385
Science: better approximate solutions......Page 386
About the reference application......Page 388
Parallelism in ES applications......Page 389
OpenMP 4.0 Affinity and hot Teams of Intel OpenMP runtime......Page 392
Hot teams motivation......Page 393
Setting up experiments......Page 394
MPI versus OpenMP......Page 395
DGEMM experiments......Page 397
User code experiments......Page 398
Summary: try multilevel parallelism in your applications......Page 403
For more information......Page 404
The GPU-HEOM application......Page 406
Building expectations......Page 407
OpenCL in a nutshell......Page 408
The Hexciton kernel OpenCL implementation......Page 409
Memory layout optimization......Page 411
Prefetching......Page 414
OpenCL performance results......Page 415
Performance portability in OpenCL......Page 416
Work-groups......Page 418
Local memory......Page 419
OpenCL performance portability results......Page 421
Substituting the OpenCL runtime......Page 422
C++ SIMD libraries......Page 423
Further optimizing the OpenMP 4.0 kernel......Page 424
OpenMP benchmark results......Page 425
Summary......Page 427
For more information......Page 428
Five benchmarks......Page 430
Experimental setup and time measurements......Page 431
HotSpot benchmark optimization......Page 432
Avoiding an array copy......Page 434
Applying blocking and reducing the number of instructions......Page 435
Vectorization of the inner block code......Page 437
HotSpot optimization conclusions......Page 443
LUD benchmark......Page 445
CFD benchmark......Page 446
Summary......Page 448
For more information......Page 450
The importance of prefetching for performance......Page 452
Prefetching on Intel Xeon Phi coprocessors......Page 453
Hardware prefetching......Page 454
Stream Triad......Page 455
Tuning prefetching......Page 456
Prefetching metrics for Stream Triad......Page 458
Useful coprocessor hardware event counters for more in-depth analysis......Page 460
Compiler prefetch distance tuning for Stream Triad......Page 461
Compiler prefetch distance tuning for Smith-Waterman......Page 462
Using intrinsic prefetches for hard-to-predict access patterns in SHOC MD......Page 463
Tuning hardware prefetching for Smith-Waterman on a processor......Page 467
Tuning hardware prefetching for SHOC MD on a processor......Page 468
For more information......Page 469
Chapter 22: SIMD Functions Via OpenMP......Page 472
SIMD vectorization overview......Page 473
Loop vectorization......Page 474
SIMD-enabled functions......Page 478
Optimizing for an architecture with compiler options and the processor(...) clause......Page 480
Supporting multiple processor types......Page 483
Vector functions in C++......Page 484
uniform(this) clause......Page 485
Summary......Page 489
For more information......Page 491
The importance of vectorization......Page 492
About DL_MESO LBE......Page 493
Intel vectorization advisor and the underlying technology......Page 494
A life cycle for experimentation......Page 495
Optimizing the compute equilibrium loop (lbpSUB:744)......Page 497
Advisor survey......Page 498
Fixing the problem......Page 499
The fGetOneMassSite function......Page 501
Advisor survey......Page 502
Fixing the problem......Page 503
The loop in function fPropagationSwap......Page 504
Dependency analysis......Page 505
Exploring possible vectorization strategies......Page 506
Projected gain and efficiency......Page 507
Memory access patterns analysis......Page 508
A final MAP analysis......Page 510
Fixing the problem......Page 511
Summary......Page 512
For more information......Page 513
Chapter 24: Portable Explicit Vectorization Intrinsics......Page 514
Why vectorization?......Page 515
Portable vectorization with OpenVec......Page 517
The vector type......Page 518
Memory allocation and alignment......Page 519
Built-in functions......Page 520
If else vectorization......Page 523
To mask or not to mask?......Page 525
Math reductions......Page 526
Compiling OpenVec code......Page 528
Real-world example......Page 529
Performance results......Page 531
Developing toward the future......Page 533
Summary......Page 535
For more information......Page 536
Motivation to act: Importance of power optimization......Page 538
Thread mapping matters: OpenMP affinity......Page 539
Methodology......Page 540
Data collection......Page 543
Analysis......Page 544
Data center: Interpretation via waterfall power data charts......Page 549
NWPerf......Page 550
Installing and configuring NWPerf and CView......Page 551
Interpreting the waterfall charts......Page 554
Summary......Page 560
For more information......Page 561
Author Index......Page 562
Subject Index......Page 564
π SIMILAR VOLUMES
High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming β illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore pro
<p><i>High Performance Parallelism Pearls</i> shows how to leverage parallelism on processors and coprocessors with the same programming β illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other mul
<p><i>High Performance Parallelism Pearls</i> shows how to leverage parallelism on processors and coprocessors with the same programming β illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other mul
<p><i>High Performance Parallelism Pearls Volume 2</i> offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming β illustrating the most effective ways to com