๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

[ACM Press the 21st international conference - Minneapolis, Minnesota, USA (2012.09.19-2012.09.23)] Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12 - Riposte

โœ Scribed by Talbot, Justin; DeVito, Zachary; Hanrahan, Pat


Book ID
120610835
Publisher
ACM Press
Year
2012
Weight
406 KB
Category
Article
ISBN
1450311822

No coin nor oath required. For personal study only.

โœฆ Synopsis


There is a growing utilization gap between modern hardware and modern programming languages for data analysis. Due to power and other constraints, recent processor design has sought improved performance through increased SIMD and multi-core parallelism. At the same time, high-level, dynamically typed languages for data analysis have become popular. These languages emphasize ease of use and high productivity, but have, in general, low performance and limited support for exploiting hardware parallelism.In this paper, we describe Riposte, a new runtime for the R language, which bridges this gap. Riposte uses tracing, a technique commonly used to accelerate scalar code, to dynamically discover and extract sequences of vector operations from arbitrary R code. Once extracted, we can fuse traces to eliminate unnecessary memory traffic, compile them to use hardware SIMD units, and schedule them to run across multiple cores, allowing us to fully utilize the available parallelism on modern shared-memory machines. Our evaluation shows that Riposte can run vector R code near the speed of hand-optimized C, 5-50x faster than the open source implementation of R, and can also linearly scale to 32 cores for some tasks. Across 12 different workloads we achieve an overall average speedup of over 150x without explicit programmer parallelization.


๐Ÿ“œ SIMILAR VOLUMES


[ACM Press the 21st international confer
โœ Ros, Alberto; Kaxiras, Stefanos ๐Ÿ“‚ Article ๐Ÿ“… 2012 ๐Ÿ› ACM Press โš– 608 KB

Much of the complexity and overhead (directory, state bits, invalidations) of a typical directory coherence implementation stems from the effort to make it "invisible" even to the strongest memory consistency model. In this paper, we show that a much simpler, directory-less/broadcast-less, multicore