๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Optimizing HPC Applications with Intel Cluster Tools

โœ Scribed by Supalov, Alexander;Semin, Andrey;Klemm, Michael;Dahnken, Christopher


Year
2014
Tongue
English
Leaves
291
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


Optimizing HPC Applications with Intel(r) Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

The book focuses on optimization for clusters consisting of the Intel(r) Xeon processor, but the optimization methodologies also apply to the Intel(r) Xeon Phi coprocessor and heterogeneous clusters mixing both architectures. Besides the tutorial and reference content, the authors address and refute many myths and misconceptions surrounding the topic. The text is augmented and enriched by descriptions of real-life situations.

What you'll learn Practical, hands-on examples show how to make clusters and workstations based on Intel(r) Xeon processors and Intel(r) Xeon Phi coprocessors "sing" in Linux environments

How to master the synergy of Intel(r) Parallel Studio XE 2015 Cluster Edition, which includes Intel(r) Composer XE, Intel(r) MPI Library, Intel(r) Trace Analyzer and Collector, Intel(r) VTune Amplifier XE, and many other useful tools

How to achieve immediate and tangible optimization results while refining your understanding of software design principles Who this book is for

Software professionals will use this book to design, develop, and optimize their parallel programs on Intel platforms. Students of computer science and engineering will value the book as a comprehensive reader, suitable to many optimization courses offered around the world. The novice reader will enjoy a thorough grounding in the exciting world of parallel computing.

Table of Contents

Foreword by Bronis de Supinski, CTO, Livermore Computing, LLNL

Introduction

Chapter 1: No Time to Read this Book?

Chapter 2: Overview of Platform Architectures

Chapter 3: Top-Down Software Optimization

Chapter 4: Addressing System Bottlenecks

Chapter 5: Addressing Application Bottlenecks: Distributed Memory

Chapter 6: Addressing Application Bottlenecks: Shared Memory

Chapter 7: Addressing Application Bottlenecks: Microarchitecture

Chapter 8: Application Design Considerations

โœฆ Table of Contents


Contents at a Glance......Page 3
Contents......Page 280
About the Authors......Page 286
About the Technical Reviewers......Page 288
Acknowledgments......Page 289
Foreword......Page 290
Introduction......Page 4
Using Intel MPI Library......Page 11
Using Intel Composer XE......Page 12
Gather Built-in Statistics......Page 13
Optimize Thread Placement......Page 15
Analyze Optimization and Vectorization Reports......Page 16
References......Page 20
Latency, Throughput, Energy, and Power......Page 21
Peak Performance as the Ultimate Limit......Page 24
Scalability and Maximum Parallel Speedup......Page 25
Bottlenecks and a Bit of Queuing Theory......Page 27
Roofline Model......Page 28
Increasing Single-Threaded Performance: Where You Can and Cannot Help......Page 30
Process More Data with SIMD Parallelism......Page 31
Use More Independent Threads on the Same Node......Page 35
Donโ€™t Limit Yourself to a Single Server......Page 36
HPC Hardware Architecture Overview......Page 37
A Multicore Workstation or a Server Compute Node......Page 38
Coprocessor for Highly Parallel Applications......Page 42
Group of Similar Nodes Form an HPC Cluster......Page 43
Other Important Components of HPC Systems......Page 45
Summary......Page 46
References......Page 47
The Three Levels and Their Impact on Performance......Page 48
System Level......Page 51
Application Level......Page 52
Working Against the Memory Wall......Page 53
Distributed Memory Parallelization......Page 54
Shared Memory Parallelization......Page 55
Other Existing Approaches and Methods......Page 56
Addressing Pipelines and Execution......Page 57
Closed-Loop Methodology......Page 58
Iterating the Optimization Process......Page 59
References......Page 61
Classifying System-Level Bottlenecks......Page 63
Identifying Issues Related to System Condition......Page 64
Characterizing Problems Caused by System Configuration......Page 67
Understanding System-Level Performance Limits......Page 71
Checking General Compute Subsystem Performance......Page 72
Testing Memory Subsystem Performance......Page 75
Testing I/O Subsystem Performance......Page 78
Characterizing Application System-Level Issues......Page 81
Selecting Performance Characterization Tools......Page 82
Monitoring the I/O Utilization......Page 84
Analyzing Memory Bandwidth......Page 89
Summary......Page 92
References......Page 93
Algorithm for Optimizing MPI Performance......Page 94
Gauging Default Intranode Communication Performance......Page 95
Gauging Default Internode Communication Performance......Page 100
Discovering Default Process Layout and Pinning Details......Page 104
Gauging Physical Core Performance......Page 107
Example 1: Initial HPL Performance Investigation......Page 109
Learning Application Behavior......Page 114
Example 2: MiniFE Performance Investigation......Page 115
Choosing Representative Workload(s)......Page 118
Example 2 (cont.): MiniFE Performance Investigation......Page 119
Example 2 (cont.): MiniFE Performance Investigation......Page 121
Example 2 (cont.): MiniFE Performance Investigation......Page 122
Example 2 (cont.): MiniFE Performance Investigation......Page 125
Example 2 (cont.): MiniFE Performance Investigation......Page 129
Dealing with Load Imbalance......Page 130
Addressing Load Imbalance......Page 131
Example 2 (cont.): MiniFE Performance Investigation......Page 132
Example 3: MiniMD Performance Investigation......Page 134
Optimizing MPI Performance......Page 138
Addressing MPI Performance Issues......Page 139
Understanding Communication Paths......Page 140
Using IP over IB......Page 141
Detecting and Classifying Improper Process Layout and Pinning Issues......Page 142
Controlling Process Layout......Page 143
Controlling the Detailed Process Layout......Page 144
Controlling the Process Pinning......Page 145
Example 4: MiniMD Performance Investigation on Xeon Phi......Page 147
Example 5: MiniGhost Performance Investigation......Page 152
Tuning Intel MPI for the Platform......Page 154
Tuning Point-to-Point Settings......Page 155
Bypassing Shared Memory for Intranode Communication......Page 156
Choosing the Best Collective Algorithms......Page 157
Disabling the Dynamic Connection Mode......Page 160
Fine-Tuning the Message-Passing Progress Engine......Page 161
What Else?......Page 162
Example 5 (cont.): MiniGhost Performance Investigation......Page 163
Avoiding Superfluous Synchronization......Page 164
Using Collective Operations......Page 165
Betting on the Computation/Communication Overlap......Page 166
Replacing Blocking Collective Operations by MPI-3 Nonblocking Ones......Page 167
Using Accelerated MPI File I/O......Page 168
Example 5 (cont.): MiniGhost Performance Investigation......Page 169
Automatically Checking MPI Program Correctness......Page 172
Comparing Application Traces......Page 173
Collecting and Analyzing Hardware Counter Information in ITAC......Page 175
Summary......Page 176
References......Page 177
Profiling Your Application......Page 179
Using VTune Amplifier XE for Hotspots Profiling......Page 180
Hotspots for the HPCG Benchmark......Page 181
Compiler-Assisted Loop/Function Profiling......Page 184
Sequential Code and Detecting Load Imbalances......Page 186
Thread Synchronization and Locking......Page 188
Dealing with Memory Locality and NUMA Effects......Page 192
Controlling OpenMP Thread Placement......Page 197
Thread Placement in Hybrid Applications......Page 202
Summary......Page 205
References......Page 206
Overview of a Modern Processor Pipeline......Page 207
Pipelined Execution......Page 209
Data Conflicts......Page 210
Control Conflicts......Page 211
Out-of-order vs. In-order Execution......Page 212
Speculative Execution: Branch Prediction......Page 213
Memory Subsystem......Page 215
Putting It All Together: A Final Look at the Sandy Bridge Pipeline......Page 216
A Top-down Method for Categorizing the Pipeline Performance......Page 217
Intel Composer XE Usage for Microarchitecture Optimizations......Page 218
Using Optimization and Vectorization Reports to Read the Compilerโ€™s Mind......Page 219
The AVX Instruction Set......Page 222
Data Dependences......Page 224
Data Aliasing......Page 226
Array Notations......Page 227
vector......Page 229
simd......Page 230
What Are Intrinsics?......Page 231
First Steps: Loading and Storing......Page 233
Arithmetic......Page 234
Data Rearrangement......Page 235
Dealing with Disambiguation......Page 238
Profile-Guided Optimization......Page 240
unroll/nounroll......Page 241
inline, noinline, forceinline......Page 242
When Optimization Leads to Wrong Results......Page 243
Using a Standard Library Method......Page 245
Using a Manual Implementation in C......Page 248
Vectorization with Directives......Page 249
Analyzing Pipeline Performance with Intel VTune Amplifier XE......Page 244
References......Page 251
Abstraction and Generalization of the Platform Architecture......Page 253
Types of Abstractions......Page 254
Raw Hardware vs. Virtualized Hardware in the Cloud......Page 257
Questions about Application Design......Page 258
Designing for Performance and Scaling......Page 259
Designing for Flexibility and Performance Portability......Page 260
Data Layout......Page 261
Structured Approach to Express Parallelism......Page 265
Understanding Bounds and Projecting Bottlenecks......Page 266
Data Storage or Transfer vs. Recalculation......Page 267
Total Productivity Assessment......Page 268
References......Page 269
Index......Page 271


๐Ÿ“œ SIMILAR VOLUMES


Optimizing HPC Applications with Intelยฎ
โœ Alexander Supalov, Andrey Semin, Michael Klemm, Christopher Dahnken (auth.) ๐Ÿ“‚ Library ๐Ÿ“… 2014 ๐Ÿ› Apress ๐ŸŒ English

<p><p><em></em></p><em>Optimizing HPC Applications with Intel&reg; Cluster Tools </em>takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models an

Optimizing HPC Applications with Intel C
โœ Alexander Supalov, Andrey Semin, Michael Klemm, Christopher Dahnken ๐Ÿ“‚ Library ๐Ÿ“… 2014 ๐Ÿ› Apress ๐ŸŒ English

Optimizing HPC Applications with Intel Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interfa

Optimizing HPC Applications with Intel C
โœ Dahnken, Christopher; Klemm, Michael; Semin, Andrey; Supalov, Alexander ๐Ÿ“‚ Library ๐Ÿ“… 2014 ๐Ÿ› Apress, Imprint: Apress ๐ŸŒ English

<p><em></em></p><em>Optimizing HPC Applications with Intelยฎ Cluster Tools </em>takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use t

Optimizing HPC Applications with Intel C
โœ Dahnken, Christopher.;Klemm, Michael.;Semin, Andrey.;Supalov, Alexander ๐Ÿ“‚ Library ๐Ÿ“… 2014 ๐Ÿ› Apress ๐ŸŒ English

Optimizing HPC Applications with Intelยฎ Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interf

Optimizing HPC Applications with Intel C
โœ Dahnken, Christopher.;Klemm, Michael.;Semin, Andrey.;Supalov, Alexander ๐Ÿ“‚ Library ๐Ÿ“… 2014 ๐Ÿ› Apress ๐ŸŒ English

Optimizing HPC Applications with Intelยฎ Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interf