CLC number: TP391.9
On-line Access: 2023-07-24
Received: 2022-10-11
Revision Accepted: 2023-07-24
Crosschecked: 2023-01-04
Cited: 0
Clicked: 1387
Juan FANG, Sheng LIN, Huijing YANG, Yixiang XU, Xing SU. A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(7): 994-1006.
@article{title="A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system",
author="Juan FANG, Sheng LIN, Huijing YANG, Yixiang XU, Xing SU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="7",
pages="994-1006",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200449"
}
%0 Journal Article
%T A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system
%A Juan FANG
%A Sheng LIN
%A Huijing YANG
%A Yixiang XU
%A Xing SU
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 7
%P 994-1006
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200449
TY - JOUR
T1 - A perceptual and predictive batch-processing memory scheduling strategy for a CPU-GPU heterogeneous system
A1 - Juan FANG
A1 - Sheng LIN
A1 - Huijing YANG
A1 - Yixiang XU
A1 - Xing SU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 7
SP - 994
EP - 1006
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200449
Abstract: When multiple central processing unit (CPU) cores and integrated graphics processing units (GPUs) share off-chip main memory, CPU and GPU applications compete for the critical memory resource. This causes serious resource competition and has a negative impact on the overall performance of the system. We describe the competition for shared-memory resources in a CPU-GPU heterogeneous multi-core architecture, and a shared-memory request scheduling strategy based on perceptual and predictive batch-processing is proposed. By sensing the CPU and GPU memory request conditions in the request buffer, the proposed scheduling strategy estimates the GPU latency tolerance and reduces mutual interference between CPU and GPU by processing CPU or GPU memory requests in batches. According to the simulation results, the scheduling strategy improves CPU performance by 8.53% and reduces mutual interference by 10.38% with low hardware complexity.
[1]Ausavarungnirun R, Chang KKW, Subramanian L, et al., 2012. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. Proc 39th Annual Int Symp on Computer Architecture, p.416-427.
[2]Binkert N, Beckmann B, Black G, et al., 2011. The gem5 simulator. ACM SIGARCH Comput Archit News, 39(2):1-7.
[3]Bitalebi H, Safaei F, 2023. Criticality-aware priority to accelerate GPU memory access. J Supercomput, 79(1):188-213.
[4]Bouvier D, Cohen B, Fry W, et al., 2014. Kabini: an AMD accelerated processing unit system on a chip. IEEE Micro, 34(2):22-33.
[5]Chen W, Ray S, Bhadra J, et al., 2017. Challenges and trends in modern SoC design verification. IEEE Des Test, 34(5):7-22.
[6]di Sanzo P, Pellegrini A, Sannicandro M, et al., 2020. Adaptive model-based scheduling in software transactional memory. IEEE Trans Comput, 69(5):621-632.
[7]Fang J, Yu L, Liu ST, et al., 2015. KL_GA: an application mapping algorithm for mesh-of-tree (MoT) architecture in network-on-chip design. J Supercomput, 71(11):4056-4071.
[8]Fang J, Wang MX, Wei ZL, 2020. A memory scheduling strategy for eliminating memory access interference in heterogeneous system. J Supercomput, 76(4):3129-3154.
[9]Hazarika A, Poddar S, Rahaman H, 2020. Survey on memory management techniques in heterogeneous computing systems. IET Comput Dig Tech, 14(2):47-60.
[10]Jamieson C, Chandrashekar A, 2022. gem5 GPU accuracy profiler (GAP). Proc 4th gem5 Users Workshop, p.44.
[11]Jeong MK, Erez M, Sudanthi C, et al., 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. Proc Design Automation Conf, p.850-855.
[12]Jog A, Kayiran O, Nachiappan NC, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGPLAN Not, 48(4):395-406.
[13]Jog A, Kayiran O, Pattnaik A, et al., 2016. Exploiting core criticality for enhanced GPU performance. Proc ACM SIGMETRICS Int Conf on Measurement and Modeling of Computer Science, p.351-363.
[14]Kim Y, Han D, Mutlu O, et al., 2010. ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. Proc 16th Int Symp on High-Performance Computer Architecture, p.1-12.
[15]Lin CH, Liu JC, Yang PK, 2020. Performance enhancement of GPU parallel computing using memory allocation optimization. Proc 14th Int Conf on Ubiquitous Information Management and Communication, p.1-5.
[16]Mittal S, Vetter JS, 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput Surv, 47(4):69.
[17]Mutlu O, Moscibroda T, 2008. Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. Proc Int Symp on Computer Architecture, p.63-74.
[18]Power J, Basu A, Gu JL, et al., 2013. Heterogeneous system coherence for integrated CPU-GPU systems. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.457-467.
[19]Rai S, Chaudhuri M, 2017. Using criticality of GPU accesses in memory management for CPU-GPU heterogeneous multi-core processors. ACM Trans Embed Comput Syst, 16(5s):133.
[20]Subramanian L, Lee D, Seshadri V, et al., 2015. The blacklisting memory scheduler: balancing performance, fairness and complexity. https://arxiv.org/abs/1504.00390v1
[21]Usui H, Subramanian L, Chang KKW, et al., 2016. DASH: deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans Archit Code Optim, 12(4):65.
[22]Wang HN, Jog A, 2019. Exploiting latency and error tolerance of GPGPU applications for an energy-efficient DRAM. Proc 49th Annual IEEE/IFIP Int Conf on Dependable Systems and Networks, p.362-374.
[23]Wang QH, Peng Z, Ren B, et al., 2022. MemHC: an optimized GPU memory management framework for accelerating many-body correlation. ACM Trans Archit Code Optim, 19(2):24.
[24]Zhan XS, Bao YG, Bienia C, et al., 2016. PARSEC3.0: a multicore benchmark suite with network stacks and SPLASH-2X. ACM SIGARCH Comput Archit News, 44(5):1-16.
[25]Zhang F, Zhai JD, He BS, et al., 2017. Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans Parall Distrib Syst, 28(3):905-918.
Open peer comments: Debate/Discuss/Question/Opinion
<1>