
CLC number: TP368.1
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2018-02-14
Cited: 0
Clicked: 8301
Yang Zhang, Zuo-cheng Xing, Cang Liu, Chuan Tang. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1700059 @article{title="CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs", %0 Journal Article TY - JOUR
CWLP:一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163-174. ![]() [2]Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44-54. ![]() [3]Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441-451. ![]() [4]Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343-355. ![]() [5]Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35. ![]() [6]Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50-57. ![]() [7]Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608-620. ![]() [8]Gebhart M, Johnson D, Tarjan D, et al., 2011. Energy-efficient mechanisms for managing thread context in throughput processors. Proc 38th Annual Int Symp Computer Architecture, p.235-246. ![]() [9]Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12. ![]() [10]Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/linebreak maxwell-most-advanced-cuda-gpu-ever-made ![]() [11]Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20th Int Symp on High Performance Computer Architecture, p.272-283. ![]() [12]Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395-406. ![]() [13]Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20th Int Symp on High Performance Computer Architecture, p.260-271. ![]() [14]Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42nd Annual Int Symp on Computer Architecture, p.515-527. ![]() [15]Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44th Annual IEEE/ACM Int Symp on Microarchitecture, p.308-317. ![]() [16]Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20th Int Symp on High Performance Computer Architecture, p.37-48. ![]() [17]NVIDIA, 2009. NVIDIA&x2019;s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/linebreak content/PDF/fermi_white_papers/NVIDIA_Fermi_linebreak Compute_Architecture_Whitepaper.pdf newpage ![]() [18]NVIDIA, 2015. NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computing-linebreak documentation ![]() [19]Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.86-98. ![]() [20]Rogers T, O&x2019;Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45th Annual IEEE/ACM Int Symp on Microarchitecture, p.72-83. ![]() [21]Rogers T, O&x2019;Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.99-110. ![]() [22]Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21st Int Symp on High Performance Computer Architecture, p.174-185. ![]() [23]Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516-523. ![]() [24]Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21st Int Symp on High Performance Computer Architecture, p.76-88. ![]() [25]Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395-406. ![]() [26]Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524-1530. ![]() [27]Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese). ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>