๐”– Bobbio Scriptorium
โœฆ   LIBER   โœฆ

[ACM Press 2011 International Conference for High Performance Computing, Networking, Storage and Analysis - Seattle, Washington (2011.11.12-2011.11.18)] Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 - Fast implementation of DGEMM on Fermi GPU

โœ Scribed by Tan, Guangming; Li, Linchuan; Triechle, Sean; Phillips, Everett; Bao, Yungang; Sun, Ninghui


Book ID
121735989
Publisher
ACM Press
Year
2011
Tongue
English
Weight
341 KB
Category
Article
ISBN
145030771X

No coin nor oath required. For personal study only.

โœฆ Synopsis


In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library 1 . We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).


๐Ÿ“œ SIMILAR VOLUMES


[ACM Press 2011 International Conference
โœ Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri; Cappello, Franck; ๐Ÿ“‚ Article ๐Ÿ“… 2011 ๐Ÿ› ACM Press ๐ŸŒ English โš– 523 KB

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead hi

[ACM Press 2011 International Conference