Evaluation of Collective I/O Implementations on Parallel Architectures
β Scribed by Phillip M. Dickens; Rajeev Thakur
- Publisher
- Elsevier Science
- Year
- 2001
- Tongue
- English
- Weight
- 269 KB
- Volume
- 61
- Category
- Article
- ISSN
- 0743-7315
No coin nor oath required. For personal study only.
β¦ Synopsis
In this paper, we evaluate the impact on performance of various implementation techniques for collective IΓO operations, and we do so across four important parallel architectures. We show that a naive implementation of collective IΓ0 does not result in significant performance gains for any of the architectures, but that an optimized implementation does provide excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all four computational platforms. Next, we evaluate implementation techniques for thread-based collective IΓO operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective IΓO operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective IΓO routine entirely in the foreground. To improve performance, we explore an alternate approach where part of the collective IΓO operation is performed in the background, and part is performed in the foreground. We demonstrate that this implementation technique can provide significant performance gains, offering up to a 50 0 improvement over implementations that do not attempt to overlap collective IΓO and computation.
π SIMILAR VOLUMES
We present the design, implementation, and evaluation of a runtime system based on collective I/O techniques for irregular applications. The design is motivated by the requirements of a large number of science and engineering applications including teraflops applications, where the data must be reor
Two different numerical solutions of the two-component kinetic collection equation were implemented on parallel computers. The parallelization approach included domain decomposition and MPI commands for communications. Four different parallel codes were tested. A dynamic decomposition based on an oc