This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to a
Models and Scheduling Algorithms for Mixed Data and Task Parallel Programs
โ Scribed by Soumen Chakrabarti; James Demmel; Katherine Yelick
- Publisher
- Elsevier Science
- Year
- 1997
- Tongue
- English
- Weight
- 400 KB
- Volume
- 47
- Category
- Article
- ISSN
- 0743-7315
No coin nor oath required. For personal study only.
โฆ Synopsis
An increasing number of scientific programs exhibit two forms of parallelism, often in a nested fashion. At the outer level, the application comprises coarse-grained task parallelism, with dependencies between tasks reflected by an acyclic graph. At the inner level, each node of the graph is a data-parallel operation on arrays. Designers of languages, compilers, and runtime systems are building mechanisms to support such applications by providing processor groups and array remapping capabilities. In this paper we explore how to supplement these mechanisms with policy. What properties of an application, its data size, and the parallel machine determine the maximum potential gains from using both kinds of parallelism? It turns out that large gains can be expected only for specific task graph structures. For such applications, what are practical and effective ways to allocate processors to the nodes of the task graph? In principle one could solve the NP-complete problem of finding the best possible allocation of arbitrary processor subsets to nodes in the task graph. Instead of this, our analysis and simulations show that a simple switched scheduling paradigm, which alternates between pure task and pure data parallelism, provides nearly optimal performance for the task graphs considered here. Furthermore, our scheme is much simpler to implement, has less overhead than the optimal allocation, and would be attractive even if the optimal allocation was free to compute. To evaluate switching in real applications, we implemented a switching task scheduler in the parallel numerical library ScaLAPACK and used it in a nonsymmetric eigenvalue program. Even for fairly large input sizes, the efficiency improves by factors of 1.5 on the Intel Paragon and 2.5 on the IBM SP-2. The remapping and scheduling overhead is negligible, between 0.5 and 5%.
๐ SIMILAR VOLUMES
This paper presents BCL, a border-based coordination language focused on the solution of numerical applications. Our approach provides a simple parallelism model. Coordination and computational aspects are clearly separated. The former are established using the coordination language and the latter a
## Abstract Parallel proximalโpoint algorithms for mixed finite element models of flow in the subsurface are presented. The applied methodology corresponds to operator splitting and nonโoverlapping domain decomposition methods, combined with resolvent or proximation characterizations and proximalโp