The Performance Implications of Locality Information Usage in Shared-Memory Multiprocessors
✍ Scribed by Frank Bellosa; Martin Steckermeier
- Publisher
- Elsevier Science
- Year
- 1996
- Tongue
- English
- Weight
- 326 KB
- Volume
- 37
- Category
- Article
- ISSN
- 0743-7315
No coin nor oath required. For personal study only.
✦ Synopsis
The latest processor generations-e.g., HPPA 8000, MIPS R10000 or Ultra SPARC-include a monitoring unit. A processor monitor can count events like read/write cache misses and processor stall cycles due to load and store operations. This information is usually only used for offline profiling. However, why should we ignore this information for optimizing the behavior of multithreaded applications at runtime?
The parallelism expressed by ''UNIX-like'' heavyweight processes and shared-memory segments is coarse-grained and too inefficient for general purpose parallel programming, because all operations on processes such as creation, deletion, and context switching invoke complex kernel activities and imply costs associated with cache and TLB misses due to address space changes.
Contemporary operating systems offer middleweight kernel-level threads decoupling address space and execution entities. Multiple kernel threads mapped to multiple processors can speed up a parallel application. However, kernel threads only offer a medium-grained programming model, because thread management inside the kernel (e.g., creating, blocking, and resuming threads) implies expensive maintenance operations on kernel structures.
By moving thread management and synchronization to user level, the cost of thread management operations can be drastically reduced to one order of magnitude more than a procedure call [1]. In general, lightweight user-level threads, managed by a runtime library, are executed by kernel threads, which in turn are mapped on the available physical processors by the kernel. Problems with this twolevel scheduling arise from the interference of scheduling policies on different levels of control without any coordination or communication. Solutions to these problems are discussed in [2,8,9].
Efficient user-level threads, which make frequent context switches affordable, are predestined for fine-grained parallel applications. This type of application draws most profit from the use of locality information if the lifetime of cachelines exceeds scheduling cycles.
In this paper we propose a nonpreemptive user-level threads package with an application interface to inform the runtime system about memory regions repeatedly used. These hints are used to trigger prefetch operations at each scheduling decision to hide memory latency. We outline