<p><span>This book constitutes the refereed proceedings of the 37th International Conference on High Performance Computing, ISC High Performance 2022, held in Hamburg, Germany, during May 29 – June 2, 2022.</span></p><p><span>The 18 full papers presented were carefully reviewed and selected from 53
High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings (Lecture Notes in Computer Science, 12728)
✍ Scribed by Bradford L. Chamberlain (editor), Ana-Lucia Varbanescu (editor), Hatem Ltaief (editor), Piotr Luszczek (editor)
- Publisher
- Springer
- Year
- 2021
- Tongue
- English
- Leaves
- 485
- Category
- Library
No coin nor oath required. For personal study only.
✦ Synopsis
This book constitutes the refereed proceedings of the 36th International Conference on High Performance Computing, ISC High Performance 2021, held virtually in June/July 2021.
The 24 full papers presented were carefully reviewed and selected from 74 submissions. The papers cover a broad range of topics such as architecture, networks, and storage; machine learning, AI, and emerging technologies; HPC algorithms and applications; performance modeling, evaluation, and analysis; and programming environments and systems software.
✦ Table of Contents
Preface
Organization
Contents
Architecture, Networks, and Storage
Microarchitecture of a Configurable High-Radix Router for the Post-Moore Era
1 Introduction
2 Pisces Router Microarchitecture
2.1 The Configurable Communication Stack with Enhanced Link Error Tolerance
2.2 Multi-port Shared DAMQ with Data Prefetch
2.3 The Internal Switch Based on Aggregated Tiles
2.4 Packet Exception Process and Congestion Control
3 Performance Evaluation
4 Conclusion
References
BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs
1 Introduction
1.1 Challenges
1.2 Motivation and Characterization
1.3 Contributions
1.4 Overview of BlueField Smart NICs
1.5 Experimental Setup
2 BluesMPI Designs
2.1 BluesMPI Non-blocking Alltoall Collective Offload Framework
2.2 Proposed Nonblocking Alltoall Designs in BluesMPI
3 Results
3.1 Performance Characterization of BluesMPI Framework
3.2 Performance of MPI Collective Operations
3.3 Application Evaluations
4 Related Work
5 Conclusion and Future Work
References
Lessons Learned from Accelerating Quicksilver on Programmable Integrated Unified Memory Architecture (PIUMA) and How That's Different from CPU
1 Introduction
2 Background
2.1 Mercury and Quicksilver
2.2 PIUMA
3 Quicksilver
3.1 High-Level Algorithm
3.2 A Deeper Analysis of cycle_tracking
4 CPU Optimizations
4.1 Engineering Optimizations
4.2 Algorithmic Optimizations
5 Quicksilver on PIUMA
5.1 Initial Porting Effort
5.2 Comparing PIUMA to Xeon
5.3 PIUMA Optimized Version
5.4 Exploring Memory Allocation Options on PIUMA
5.5 A Closer Look at Strong Scaling
5.6 Hitting the Scaling Limit on PIUMA
6 Related Work
7 Conclusion
References
A Hierarchical Task Scheduler for Heterogeneous Computing
1 Introduction
2 Background and Related Work
3 RANGER Architecture and Implementation
3.1 Baseline Accelerator Architecture
3.2 RANGER Architecture and Memory-Mapped IO Interface
3.3 Top-Level Scheduler
3.4 Low-Level Scheduler
3.5 Implementation Details of Accelerator Kernels
4 Experimental Evaluation
4.1 Application Benchmarks
4.2 Scalability Study
4.3 Overhead of the Local Schedulers
5 Conclusion
References
Machine Learning, AI, and Emerging Technologies
Auto-Precision Scaling for Distributed Deep Learning
1 Introduction
2 Related Work
3 APS: Auto-Precision-Scaling
3.1 The Limitation of the Loss Scaling Algorithm
3.2 Layer-Wise Precision for Scaling the Gradients
3.3 Technical Details for APS
4 Experiments
4.1 Training on Small-Scale Distributed Systems
4.2 Training on Large-Scale Distributed Systems
4.3 Performance Analysis
5 CPD: Customized-Precision Deep Learning
6 Conclusion
References
FPGA Acceleration of Number Theoretic Transform
1 Introduction
2 Related Work
3 Background
3.1 Fully Homomorphic Encryption (FHE)
3.2 Number Theoretic Transform (NTT)
3.3 Modular Reduction
3.4 Challenges in Accelerating NTT
4 Accelerator Design
4.1 Design Methodology
4.2 NTT Core
4.3 Permutation Network
5 Experiments and Results
5.1 Experimental Setup
5.2 Performance Evaluation
5.3 Resource Utilization
5.4 Evaluation of NTT Core and Streaming Permutation Network
5.5 Comparison with Prior Work
6 Conclusion
References
Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences
1 Introduction
1.1 Contributions
2 Background
2.1 Radeon Open Compute (ROCm)
2.2 ROCm Remote Direct Memory Access (RDMA)
2.3 Inter-Process Communication (IPC)
2.4 Message Passing Interface (MPI)
2.5 Protocols for High-Performance Communication in MPI
3 Designing and Implementation of ROCm-Aware MPI
3.1 Overview of Technologies Offered by NVIDIA and AMD for GPU Based Communication
3.2 Designing Unified Device Abstraction Interface for Accelerator-Aware MPI
3.3 PeerDirect
3.4 CPU-Driven GPU Mapped Memory Copy Based Design
3.5 ROCm IPC Based Design
4 Performance Evaluation
4.1 Experimental Setup
4.2 Micro-Benchmark Evaluation
4.3 Application-Level Evaluation
5 Related Work
6 Conclusion
References
A Tunable Implementation of Quality-of-Service Classes for HPC Networks
1 Introduction
2 Background and Related Work
2.1 Communication Characteristics and Performance Targets
2.2 Managing Contention for Shared Channels on HPC Networks
2.3 QoS Solutions for HPC
3 Design of a Tunable QoS Solution
3.1 Flexible Traffic Shaping Using Two Rate Limits
3.2 Defining QoS Classes for HPC Traffic
4 Evaluation of QoS Solution
4.1 CODES Simulation Toolkit
4.2 Network Setup
4.3 Workload Setup
4.4 Bandwidth Shaping for Dynamic Workloads
4.5 Supporting Specially Defined QoS Classes
5 Discussion
5.1 Tuning Class Configurations to Match Workload Requirements
5.2 Production Deployment
6 Conclusions
References
Scalability of Streaming Anomaly Detection in an Unbounded Key Space Using Migrating Threads
1 Introduction
2 Background
2.1 Firehose Streaming Benchmark
2.2 Migrating Thread Architecture
3 Firehose on Migrating Threads
3.1 Datum Conversion and Assignment: Producers
3.2 Anomaly Detection: Consumers
3.3 Maintaining Hash Map Size: LRU List
4 Conventional Implementation Using MPI
5 Communication Overhead
6 Experimental Setup
6.1 Program Execution
6.2 Dataset Generation and Placement
6.3 Scaling Tests
7 Evaluation
7.1 Throughput Scalability
7.2 Overlapped Datum Conversion and Analysis
8 Conclusion
References
HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads
1 Introduction
2 HTA - Background, Rationale, and Design
2.1 Partitioned Memory Controller
2.2 Interconnect
2.3 Packaging
2.4 HTA Architecture
3 Methodology
3.1 System Comparisons
3.2 Simulations
4 Evaluation
4.1 Evaluation of Partitioned Memory Controller
4.2 Evaluation of HTA
4.3 Comparison with Multi-GPU Systems
5 Related Work
6 Conclusion
References
Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems
1 Introduction
2 Related Work and Background
2.1 Anomaly Detection and Autoencoders
2.2 Machine Learning for HPC Monitoring Analytics
3 Our Proposed Framework: PROCTOR
3.1 Feature Extraction
3.2 Unsupervised Pretraining
3.3 Supervised Training
3.4 Detection and Diagnosis at Runtime
4 Experimental Methodology
4.1 HPC Systems and Applications
4.2 Monitoring Framework
4.3 Synthetic Anomalies
4.4 Baselines
4.5 Implementation Details
5 Evaluation
5.1 Performance Metrics
5.2 Data Set Preparation
5.3 Anomaly Detection Results
5.4 Anomaly Diagnosis Results
5.5 Impact of Previously Unseen Anomalies
6 Conclusion
References
HPC Algorithms and Applications
COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling
1 Introduction
2 Preliminaries and Notation
3 Communication Cost Function
3.1 Communication Graph
4 Communication-Optimal Process Relabeling (COPR)
4.1 The Formal Definition
4.2 COPR as Linear Assignment Problem
4.3 COPR Algorithm
5 COSTA: Comm-Optimal Shuffle and Transpose Alg.
6 Implementation Details
7 Performance Results
7.1 COSTA vs. ScaLAPACK
7.2 Process Relabeling
7.3 Real-World Application: RPA Simulations
8 Conclusion
References
Enabling AI-Accelerated Multiscale Modeling of Thrombogenesis at Millisecond and Molecular Resolutions on Supercomputers
1 Introduction
2 Related Work
3 The Methods
3.1 The Multiscale Model
3.2 AI-MTS
3.3 Numerical Experiments
3.4 The Measures
3.5 The Supercomputers
4 The in Silico Experiment Results
4.1 Platelet Dynamics
4.2 Blood Flow
5 Performance Analysis
6 Discussions and Future Work
References
Evaluation of the NEC Vector Engine for Legacy CFD Codes
1 Introduction
1.1 NEC Vector Architecture
1.2 Comparison with Reference Architectures
1.3 Benchmark Studies
2 FDL3DI
2.1 Problem Description
2.2 Initial Performance Observation
2.3 Optimization Process for FDL3DI
2.4 Performance Analysis Using the NEC Toolchain
2.5 Optimization Techniques
3 FDL3DI Performance with Optimization
3.1 Roofline Analysis of FDL3DI
4 Conclusions/Future Work
References
Distributed Sparse Block Grids on GPUs
1 Introduction
2 Single-GPU Sparse Block Grids
3 Multi-GPU Distributed Sparse Block Grids
3.1 Packing and Serialization
3.2 Unpacking and Deserialization
4 Implementation in OpenFPM
4.1 Optimizing CPU Performance
5 Benchmark Results
5.1 Single-GPU Performance
5.2 Multi-GPU Performance
6 Related Work
7 Conclusions
References
iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs
1 Introduction
2 IPU Hardware
2.1 Architecture
2.2 Programming Model
3 Background
3.1 Related Work
3.2 Graph Algorithms in the Language of Linear Algebra
4 BFS Implementation on IPU
4.1 Parallel BFS
4.2 Parallel Top-Down
4.3 Mapping Data and Compute
4.4 Challenges of IPU Graph Implementations
4.5 Optimizations
5 Experimental Setup
6 Experimental Results
6.1 Performance Comparison Experiment
6.2 Graph 500 Scaling Experiment
7 Discussion
8 Conclusion
References
Performance Modeling, Evaluation, and Analysis
Optimizing GPU-Enhanced HPC System and Cloud Procurements for Scientific Workloads
1 Introduction
2 Cost Model Based HPC Procurement
2.1 Methodology
2.2 Demonstration on a Proxy Scientific Workload
3 Prior Work
4 Demonstration and Results
4.1 Description of Proxy Workload
4.2 Scaling and Cost Optimization of Proxy Workload
5 Conclusion
References
A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application
1 Introduction
2 Background
2.1 High-Performance Molecular Docking
2.2 Modern Parallel Programming Models
2.3 Performance Portability
3 Evaluation Methodology
3.1 A BUDE Mini-App
3.2 Performance Analysis
4 Results and Performance Analysis
4.1 CPUs
4.2 GPUs
5 Towards Portable High-Performance Code
6 Future Work
7 Reproducibility
8 Conclusion
References
Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact
1 Introduction
1.1 Idle Waves in Barrier-Free Bulk-Synchronous Parallel Programs
1.2 Related Work
1.3 Contribution
2 Test Bed and Experimental Methods
3 Idle Wave Propagation Velocity for Scalable Code
3.1 Execution Characteristics
3.2 Categorization of Communication Characteristics
3.3 Analytical Model of Idle Wave Propagation
3.4 Experimental Validation
4 Idle Waves Interacting with MPI Collectives
5 Idle Wave Decay
5.1 Topological Decay
5.2 Noise-Induced Decay
6 Summary and Future Work
References
Performance of the Supercomputer Fugaku for Breadth-First Search in Graph500 Benchmark
1 Introduction
2 The Supercomputer Fugaku
3 Hybrid-BFS for Large-Scale System
3.1 Algorithm for Shared Memory System
3.2 Algorithm for Distributed Memory System
4 Improvement to Hybrid-BFS
4.1 Bitmap-Based Representation for Adjacency Matrix
4.2 Sorting of Vertex Number
4.3 Yoo's Distribution of Adjacency Matrix
4.4 Load Balancing in Top-Down Approach
4.5 Communication in Bottom-Up Approach
5 Performance Optimization for Fugaku
5.1 Graph500 Benchmark
5.2 Setting Parameters
5.3 Optimization of the Number of Processes per Node
5.4 Use of Eager Method
5.5 Power Management
5.6 Six-Dimensional Process Mapping
6 Performance Evaluation on Fugaku
6.1 Performance on Whole Fugaku System
6.2 Comparison with Other Systems
7 Conclusion and Future Work
References
Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application
1 Introduction
2 Parallelizing Unstructured-Mesh Applications
3 SYCL Parallelizations with OP2
3.1 Coloring
3.2 Atomics
4 Performance
4.1 CPU Results
4.2 NVIDIA and AMD GPU Results
4.3 Intel Iris XE MAX Performance
5 Bottleneck Analysis
6 Conclusion
References
Characterizing Containerized HPC Applications Performance at Petascale on CPU and GPU Architectures
1 Introduction
1.1 Contributions
2 Related Work
3 Background
3.1 Container Technologies
3.2 Microbenchmarks and Applications
4 Performance Evaluation
4.1 Experimental Setup
4.2 Micro-benchmark Evaluation
4.3 Application Level Evaluation
4.4 IO Benchmark and Application
4.5 Capacity Workload Performance
4.6 Outcomes
5 Discussion
5.1 Containerization in the Linux Kernel
5.2 Container Portability
5.3 Recommendations
6 Conclusion
References
Ubiquitous Performance Analysis
1 Introduction
2 State of the Art
3 Ubiquitous Performance Analysis
3.1 Overview
3.2 Code Instrumentation
3.3 ConfigManager: A Measurement Control API in Caliper
3.4 Adiak: A Library for Recording Program Metadata
3.5 SPOT: A Web Interface for Ubiquitous Performance Analysis
3.6 Ubiquitous Data Collection
4 Example: LULESH
4.1 Region Instrumentation with Caliper
4.2 Metadata Collection with Adiak
4.3 Integrating the Caliper ConfigManager API
4.4 Data Analysis and Visualization in SPOT
5 Overhead Evaluation
6 Case Study: Marbl
7 Conclusion
References
Programming Environments and Systems Software
Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning
1 Introduction
2 Background
3 Artemis: Design and Implementation
3.1 Design
3.2 Training and Optimization
3.3 Validation and Retraining
3.4 Extending RAJA OpenMP Execution
3.5 Enhancing Kokkos CUDA Execution
3.6 Training Measurement
3.7 Training Model Analysis and Optimization
4 Experimentation Setup
4.1 Comparators
4.2 Applications
4.3 Hardware and Software Platforms
4.4 Statistical Evaluation
5 Evaluation
5.1 Instrumentation Overhead
5.2 Model Training and Evaluation Overhead
5.3 Speedup on Cleverleaf
5.4 Effectiveness of Cleverleaf Policy Selection
5.5 Strong Scaling with Different Node Counts
5.6 Speedup on LULESH
5.7 Speedup on Kokkos Kernels SpMV
6 Related Work
7 Conclusion and Future Work
References
Correction to: Performance of the Supercomputer Fugaku for Breadth-First Search in Graph500 Benchmark
Correction to: Chapter “Performance of the Supercomputer Fugaku for Breadth-First Search in Graph500 Benchmark” in: B. L. Chamberlain et al. (Eds.): High Performance Computing, LNCS 12728, https://doi.org/10.1007/978-3-030-78713-4_20
Author Index
📜 SIMILAR VOLUMES
<span>This book constitutes the proceedings of the 38th International Conference on High Performance Computing, ISC High Performance 2023, which took place in Hamburg, Germany, in May 2023. <br>The 21 papers presented in this volume were carefully reviewed and selected from 78 submissions. They were
<p><span>This book constitutes the refereed post-conference proceedings of 9 workshops held at the 35th International ISC High Performance 2021 Conference, in Frankfurt, Germany, in June-July 2021:</span></p><p><span>Second International Workshop on the Application of Machine Learning Techniques to
<p>This book constitutes the refereed proceedings of the 35th International Conference on High Performance Computing, ISC High Performance 2020, held in Frankfurt/Main, Germany, in June 2020.*<p>The 27 revised full papers presented were carefully reviewed and selected from 87 submissions. The papers
<p>This book constitutes the proceedings of the 27th International Conference on Information Processing in Medical Imaging, IPMI 2021, which was held online during June 28-30, 2021. The conference was originally planned to take place in Bornholm, Denmark, but changed to a virtual format due to the C
<span>This volume constitutes the papers of several workshops which were held in conjunction with the 38th International Conference on High Performance Computing, ISC High Performance 2023, held in Hamburg, Germany, during May 21–25, 2023. <br>The 49 revised full papers presented in this book were c