JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Author(s): Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang
Affiliation(s): National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
Corresponding email(s): zhangyang@nudt.edu.cn
Key Words: Locality, Graphics processing unit (GPU), Cache allocation, Warp scheduling

Share this article to： More <<< Previous Paper \|Next Paper >>>

Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1700059

@article{title="CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs",
author="Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1700059"
}

%0 Journal Article
%T CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs
%A Yang Zhang
%A Zuo-cheng Xing
%A Cang Liu
%A Chuan Tang
%J Frontiers of Information Technology & Electronic Engineering
%P 206-220
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1700059"

TY - JOUR
T1 - CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs
A1 - Yang Zhang
A1 - Zuo-cheng Xing
A1 - Cang Liu
A1 - Chuan Tang
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 206
EP - 220
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1700059"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU&x2019;s poor warp scheduling method. Thus, benefits of GPU&x2019;s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.

CWLP：一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略

概要：随着我们正在接近百亿亿次超级计算机的时代，一个拥有强大运算能力和低能耗的均衡的计算机系统变得越来越重要。GPUs是在最近投入运营的超级计算机中被广泛使用的加速器。它采用大规模多块程来隐藏长访存延迟，同时它拥有高能效。相对于其强大的运算能力，GPUs的每个流多核处理器只有几兆的片上资源。面向吞吐率的执行模型与它的高速缓存层次结构设计不匹配，使得GPUs缓存表现出较差的运行效率。由于片上存储器的严重缺少，受较差的缓存性能影响，GPU的计算能力急剧下降，限制了系统性能和能效。提出一种协同的线程束调度和局部性保护的缓存分配策略（CWLP），以充分利用数据局部性和隐藏延迟。首先，设计了一种基于指令PC的局部性保护方法（LPC）以提升GPU性能。使用一个基于PC的收集器收集每个高速缓存块的重用信息。在获取缓存块的动态重用信息后，采用一个智能缓存分配单元（PCAU），它结合了重用信息和LRU（最近最少使用）替换策略，以找到拥有最少局部性的缓存块并将其逐出。此外，局部性信息被线程束调度器用来实现一个智能的重排序策略，用以获取局部性和隐藏延迟。实验结果表明，CWLP能够提供高达19.8%的性能加速比和超过基准策略平均8.8%的性能提升。

关键词组：局部性GPU；cache分配；线程束调度

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163-174.

[2]Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44-54.

[3]Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27^th Int Symp on Parallel & Distributed Processing, p.441-451.

[4]Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47^th Annual IEEE/ACM Int Symp on Microarchitecture, p.343-355.

[5]Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35.

[6]Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50-57.

[7]Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608-620.

[8]Gebhart M, Johnson D, Tarjan D, et al., 2011. Energy-efficient mechanisms for managing thread context in throughput processors. Proc 38^th Annual Int Symp Computer Architecture, p.235-246.

[9]Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12.

[10]Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/linebreak maxwell-most-advanced-cuda-gpu-ever-made

[11]Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20^th Int Symp on High Performance Computer Architecture, p.272-283.

[12]Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395-406.

[13]Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20^th Int Symp on High Performance Computer Architecture, p.260-271.

[14]Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42^nd Annual Int Symp on Computer Architecture, p.515-527.

[15]Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44^th Annual IEEE/ACM Int Symp on Microarchitecture, p.308-317.

[16]Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20^th Int Symp on High Performance Computer Architecture, p.37-48.

[17]NVIDIA, 2009. NVIDIA&x2019;s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/linebreak content/PDF/fermi_white_papers/NVIDIA_Fermi_linebreak Compute_Architecture_Whitepaper.pdf newpage

[18]NVIDIA, 2015. NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computing-linebreak documentation

[19]Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46^th Annual IEEE/ACM Int Symp on Microarchitecture, p.86-98.

[20]Rogers T, O&x2019;Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45^th Annual IEEE/ACM Int Symp on Microarchitecture, p.72-83.

[21]Rogers T, O&x2019;Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46^th Annual IEEE/ACM Int Symp on Microarchitecture, p.99-110.

[22]Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21^st Int Symp on High Performance Computer Architecture, p.174-185.

[23]Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516-523.

[24]Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21^st Int Symp on High Performance Computer Architecture, p.76-88.

[25]Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395-406.

[26]Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524-1530.

[27]Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese).

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

CWLP：一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference