๐”– Scriptorium
โœฆ   LIBER   โœฆ

๐Ÿ“

Performance Analysis of Parallel Applications for HPC

โœ Scribed by Jidong Zhai, Yuyang Jin, Wenguang Chen, Weimin Zheng


Publisher
Springer
Year
2023
Tongue
English
Leaves
259
Edition
1st ed. 2023
Category
Library

โฌ‡  Acquire This Volume

No coin nor oath required. For personal study only.

โœฆ Synopsis


This book presents a hybrid static-dynamic approach for efficient performance analysis of parallel applications on HPC systems. Performance analysis is essential to finding performance bottlenecks and understanding the performance behaviors of parallel applications on HPC systems. However, current performance analysis techniques usually incur significant overhead. Our book introduces a series of approaches for lightweight performance analysis.

We combine static and dynamic analysis to reduce the overhead of performance analysis. Based on this hybrid static-dynamic approach, we then propose several innovative techniques for various performance analysis scenarios, including communication analysis, memory analysis, noise analysis, computation analysis, and scalability analysis. Through these specific performance analysis techniques, we convey to readers the idea of using static analysis to support dynamic analysis.

To gain the most from the book, readers should have a basic grasp of parallel computing, computer architecture, and compilation techniques.


โœฆ Table of Contents


Preface
Acknowledgments
Contents
Acronyms
1 Background and Overview
1.1 Background of Performance Analysis
1.2 Hybrid Static-Dynamic Approaches
1.3 Overview of Book Structure
References
Part I Performance Analysis Methods: Communication Analysis
2 Fast Communication Trace Collection
2.1 Introduction
2.2 Related Work
2.3 Design Overview
2.4 Live-Propagation Slicing Algorithm
2.4.1 Slicing Criterion
2.4.2 Dependence of MPI Programs
2.4.3 Intra-procedural Analysis
2.4.4 Inter-procedural Analysis
2.4.5 Discussions
2.5 Implementation
2.5.1 Compilation Framework
2.5.2 Runtime Environment
2.6 Evaluation
2.6.1 Methodology
2.6.2 Validation
2.6.3 Performance
2.6.3.1 Memory Consumption
2.6.3.2 Execution Time
2.7 Applications
2.7.1 Optimize Process Placement of MPI Programs
2.7.2 Sensitivity Analysis of Communication Patterns to Input Parameters
2.8 Limitations and Discussions
2.9 Conclusions
References
3 Structure-Based Communication Trace Compression
3.1 Introduction
3.2 Overview
3.3 Extracting Communication Structure
3.3.1 Intra-procedural Analysis Algorithm
3.3.2 Inter-procedural Analysis Algorithm
3.4 Runtime Communication Trace Compression
3.4.1 Intra-process Communication Trace Compression
3.4.2 Inter-process Communication Trace Compression
3.5 Decompression and Performance Analysis
3.6 Implementation
3.7 Evaluation
3.7.1 Methodology
3.7.2 Communication Trace Size
3.7.3 Trace Compression Overhead
3.7.3.1 Intra-process Overhead
3.7.3.2 Inter-process Overhead
3.7.3.3 Compilation Overhead of Cypress
3.7.4 Case Study
3.7.4.1 Analyzing Communication Patterns
3.7.4.2 Performance Prediction
3.8 Related Work
3.9 Conclusions
References
Part II Performance Analysis Methods: Memory Analysis
4 Informed Memory Access Monitoring
4.1 Introduction
4.2 Overview
4.2.1 Spindle Framework
4.2.2 Sample Input/Output: Memory Trace Collector
4.3 Static Analysis
4.3.1 Intra-procedural Analysis
4.3.1.1 Extracting Program Control Structure
4.3.1.2 Building Memory Dependence Trees
4.3.2 Inter-procedural Analysis
4.3.3 Special Cases and Complications
4.4 Spindle-Based Runtime Monitoring
4.4.1 Runtime Information Collection
4.4.2 Spindle-Based Tool Developing
4.4.2.1 Memory Bug Detector (S-Detector)
4.4.2.2 Memory Trace Collector (S-Tracer)
4.5 Evaluation
4.5.1 Experiment Setup
4.5.2 Spindle Compilation Overhead
4.5.3 S-Detector for Memory Bug Detection
4.5.4 S-Tracer for Memory Trace Collection
4.6 Related Work
4.7 Conclusion and Future Work
References
Part III Performance Analysis Methods: Scalability Analysis
5 Graph Analysis for Scalability Analysis
5.1 Introduction
5.2 Design Overview
5.3 Graph Generation
5.3.1 Static Program Structure Graph Construction
5.3.2 Sampling-Based Profiling
5.3.2.1 Associate Vertices with Performance Data
5.3.2.2 Graph-Guided Communication Dependence
5.3.2.3 Indirect Function Calls
5.3.3 Program Performance Graph
5.4 Scaling Loss Detection
5.4.1 Location-Aware Problematic Vertex Detection
5.4.2 Backtracking Root Cause Detection
5.5 Implementation and Usage
5.6 Evaluation
5.6.1 Experimental Setup
5.6.2 PSG Analysis
5.6.3 Performance Overhead
5.6.4 Case Studies with Real Applications
5.6.4.1 Zeus-MP
5.6.4.2 SST
5.6.4.3 Nekbone
5.7 Related Work
5.8 Conclusion
References
6 Performance Prediction for Scalability Analysis
6.1 Introduction
6.1.1 Motivation
6.1.2 Our Approach and Contributions
6.2 Base Prediction Framework
6.3 Definitions
6.3.1 Communication Sequence
6.3.2 Sequential Computation Vector
6.4 Sequential Computation Time
6.4.1 Deterministic Replay
6.4.2 Acquire Sequential Computation Time
6.4.3 Concurrent Replay
6.5 Representative Replay
6.5.1 Challenges for Large-Scale Applications
6.5.2 Computation Similarity
6.5.3 Select Representative Processes
6.6 Convolute Computation and Communication Performance
6.7 Implementation
6.8 Evaluation
6.8.1 Methodology
6.8.2 Sequential Computation Time
6.8.2.1 The Number of Representative Replay Groups
6.8.2.2 Validation of Sequential Computation Time
6.8.2.3 Analysis of Sequential Computation Time
6.8.3 Performance Prediction for HPC Platforms
6.8.4 Performance Prediction for Amazon Cloud Platform
6.8.5 Message Log Size and Replay Overhead
6.8.6 Performance of SIM-MPI Simulator
6.9 Discussions
6.10 Related Work
6.11 Conclusion
References
Part IV Performance Analysis Methods: Noise Analysis
7 Lightweight Noise Detection
7.1 Introduction
7.2 vSensor Design
7.3 Fixed-Workload V-Sensors
7.3.1 Fixed-Workload V-Sensor Definition
7.3.2 Analysis of Intra-procedure
7.3.3 Analysis of Inter-procedure
7.3.4 Multiple-Process Analysis
7.3.5 Whole Program Analysis
7.4 Regular-Workload V-Sensors
7.4.1 Instruction Sequences
7.4.2 Regular Workload Definition
7.5 Program Instrumentation
7.5.1 V-Sensor Selection
7.5.2 Inserting External V-Sensors
7.5.3 Analyzing External V-Sensors
7.6 Runtime Performance Variance Detection
7.6.1 Smoothing Data
7.6.2 Normalizing Performance
7.6.3 History Comparison
7.6.4 Multiple-Process Analysis
7.6.5 Performance Variance Report
7.7 Experiment
7.7.1 Experimental Setup
7.7.2 Overall Analysis of Fixed V-Sensors
7.7.3 Analysis of Regular V-Sensors
7.7.4 External Analysis of V-Sensors
7.7.5 V-Sensor Distribution
7.7.6 Injecting Noise
7.7.7 Case Studies
7.8 Related Work
7.9 Conclusion
References
8 Production-Run Noise Detection
8.1 Introduction
8.2 Overview
8.3 Performance Variance Detection
8.3.1 Fixed-Workload Fragments
8.3.2 State Transition Graph
8.3.3 Performance Data Collection
8.3.4 Identifying Fixed-Workload Fragments
8.3.5 Performance Variance Detection
8.4 Performance Variance Diagnosis
8.4.1 Variance Breakdown Model
8.4.2 Quantifying Time of Factors
8.4.3 Progressive Variance Diagnosis
8.5 Implementation
8.6 Evaluation
8.6.1 Evaluation Setup
8.6.2 Overhead and Detection Coverage
8.6.3 Verification of Fixed Workload Identification
8.6.4 Comparing with Profiling Tools
8.6.5 Case Studies
8.6.5.1 Detection of a Hardware Bug
8.6.5.2 Detection of Memory Problem
8.6.5.3 Detection of IO Performance Variance
8.7 Related Work
8.8 Conclusion
References
Part V Performance Analysis Framework
9 Domain-Specific Framework for Performance Analysis
9.1 Introduction
9.2 Overview
9.2.1 PerFlow Framework
9.2.2 Example: A Communication Analysis Task
9.3 Graph-Based Performance Abstraction
9.3.1 Definition of PAG
9.3.2 Hybrid Static-Dynamic Analysis
9.3.3 Performance Data Embedding
9.3.4 Views of PAG
9.4 PerFlow Programming Abstraction
9.4.1 PerFlowGraph
9.4.2 PerFlowGraph Element
9.4.3 Building Performance Analysis Pass
9.4.3.1 Low-Level API Design
9.4.3.2 Example Cases
9.4.4 Performance Analysis Paradigm
9.4.5 Usage of PerFlow
9.5 Evaluation
9.5.1 Experimental Setup
9.5.2 Overhead and PAG
9.5.3 Case Study A: ZEUS-MP
9.5.4 Case Study B: LAMMPS
9.5.5 Case Study C: Vite
9.6 Related Work
9.7 Conclusion
References
10 Conclusion and Future Work


๐Ÿ“œ SIMILAR VOLUMES


Performance Analysis of Parallel Applica
โœ Jidong Zhai, Yuyang Jin, Wenguang Chen, Weimin Zheng ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Springer ๐ŸŒ English

<p><span>This book presents a hybrid static-dynamic approach for efficient performance analysis of parallel applications on HPC systems. Performance analysis is essential to finding performance bottlenecks and understanding the performance behaviors of parallel applications on HPC systems. However,

Performance Modelling Techniques for Par
โœ D. A. Grove; P.D. Coddington ๐Ÿ“‚ Library ๐Ÿ“… 2009 ๐Ÿ› Nova Science Publishers, Incorporated ๐ŸŒ English

Ever since the invention of the computer, users have demanded more and more computational power to tackle increasingly complex problems. A common means of increasing the amount of computational power available for solving a problem is to use parallel computing. Unfortunately, however, creating effic

Performance Analysis and Optimization of
โœ Qinchuan Li, Chao Yang, Lingmin Xu, Wei Ye ๐Ÿ“‚ Library ๐Ÿ“… 2023 ๐Ÿ› Springer-HUST ๐ŸŒ English

<span>This book investigates the performance analysis and optimization design of parallel manipulators in detail. It discusses performance evaluation indices for workspace, kinematic, stiffness, and dynamic performance, single- and multi-objective optimization design methods, and ways to improve opt

Analysis, transformation and optimizatio
โœ Prihozhy A. A. ๐Ÿ“‚ Library ๐Ÿ“… 2019 ๐Ÿ› ะญะ‘ะก ะ›ะฐะฝัŒ ๐ŸŒ Russian

This book studies hardware and software specifications at algorithmic level from the point of measuring and extracting the potential parallelism hidden in them. It investigates the possibilities of using this parallelism for the synthesis and optimization of highperformance software and hardware imp