CLC number: TP319
On-line Access: 2020-09-09
Received: 2019-05-24
Revision Accepted: 2020-05-31
Crosschecked: 2020-08-10
Cited: 0
Clicked: 4097
Ran Zheng, Yuan-dong Liu, Hai Jin. Optimizing non-coalesced memory access for irregular applications with GPU computing[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900262 @article{title="Optimizing non-coalesced memory access for irregular applications with GPU computing", %0 Journal Article TY - JOUR
运用GPU计算面向非规则应用的非合并内存访问优化1华中科技大学大数据技术与系统国家地方联合工程研究中心,中国武汉市,430074 2华中科技大学服务计算技术与系统实验室,中国武汉市,430074 3华中科技大学集群与网格计算实验室,中国武汉市,430074 4华中科技大学计算机科学与技术学院,中国武汉市,430074 摘要:通用图形处理器(GPGPU)可大大提升规则应用的计算性能。然而,很多应用中存在非规则内存访问模式,大大限制了GPU的性能优势。近年来,一些研究提出解决方案来移除静态非规则内存访问。然而,利用软件消除动态非规则内存访问仍然面临严峻挑战。本文提出一种纯软件解决方案用于消除动态非规则内存访问,尤其是间接内存访问,无需硬件扩展和离线分析。提出数据重组和索引重定向以减少内存访问次数,从而提高GPU内核性能。为提高数据重组效率,卸载重组数据操作至GPU以降低开销并传输数据。通过并发执行数据重组和数据处理内核的统一计算设备架构(CUDA)流,可降低数据重组开销。完成这些优化后,相比于CUSPARSE基准测试,使用该方法GPU内核的内存数据传输减少了16.7%-50%;同时,NVIDIA Tesla P4 GPU上的内核性能提高了9.64%-34.9%。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Alandoli M, Shehab M, Al-Ayyoub M, et al., 2016. Using GPUs to speed-up FCM-based community detection in social networks. Proc 7th Int Conf on Computer Science and Information Technology, p.1-6. [2]Baskaran MM, Bordawekar R, 2008. Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-Time and Run-Time Strategies. IBM Research Report, RC24704 (W0812-047). [3]Baskaran MM, Bondhugula U, Krishnamoorthy S, et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc 22nd Annual Int Conf on Supercomputing, p.225-234. [4]Bhatotia P, Rodrigues R, Verma A, 2012. Shredder: GPU-accelerated incremental storage and computation. Proc 10th USENIX Conf on File and Storage Technologies, p.14. [5]Burtscher M, Nasre R, Pingali K, 2012. A quantitative study of irregular programs on GPUs. Proc IEEE Int Symp on Workload Characterization, p.141-151. [6]Danalis A, Marin G, McCurdy C, et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74. [7]Davis TA, Hu YF, 2011. The university of Florida sparse matrix collection. ACM Trans Math Softw, 38(1):1. [8]Fan WS, Wang B, Paul JC, et al., 2013. An octree-based proxy for collision detection in large-scale particle systems. Sci China Inform Sci, 56(1):1-10. [9]Fung WW, Sham I, Yuan G, et al., 2007. Dynamic warp formation and scheduling for efficient GPU control flow. Proc 40th Annual IEEE/ACM Int Symp on Microarchitecture, p.407-420. [10]Lee S, Min S, Eigenmann R, 2009. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM SIGPLAN Not, 44(4):101-110. [11]Li KL, Yang WD, Li KQ, 2015. Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans Parall Distrib Syst, 26(1):196-205. [12]Li KL, Yang WD, Li KQ, 2016. A hybrid parallel solving algorithm on GPU for quasi-tridiagonal system of linear equations. IEEE Trans Parall Distrib Syst, 27(10):2795-2808. [13]Meng JY, Tarjan D, Skadron K, 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput Arch News, 38(3):235-246. [14]Mokhtari R, Stumm M, 2014. BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. Proc IEEE 28th Int Parallel and Distributed Processing Symp, p.819-828. [15]Pickering BP, Jackson CW, Scogland TRW, et al., 2015. Directive-based GPU programming for computational fluid dynamics. Comput Fluids, 114:242-253. [16]Sabne A, Sakdhnagool P, Eigenmann R, 2013. Scaling large-data computations on multi-GPU accelerators. Proc 27th Int ACM Conf on Supercomputing, p.443-454. [17]Song W, Wu D, Wong R, et al., 2015. A real-time interactive data mining and visualization system using parallel computing. Proc 10th Int Conf on Digital Information Management, p.10-13. [18]Sourouri M, Gillberg T, Baden SB, et al., 2014. Effective multi-GPU communication using multiple CUDA streams and threads. Proc 20th IEEE Int Conf on Parallel and Distributed Systems, p.981-986. [19]Ueng SZ, Lathara M, Baghsorkhi SS, et al., 2008. CUDA-Lite: reducing GPU programming complexity. Proc 21st Int Workshop on Languages and Compilers for Parallel Computing, p.1-15. [20]Wang L, Spurzem R, Aarseth S, et al., 2015. NBODY6++GPU: ready for the gravitational million-body problem. Mon Not R Astron Soc, 450(4):4070-4080. [21]Wang Y, Davidson A, Pan YC, et al., 2015. Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Not, 50(8):265-266. [22]Wu B, Zhao ZJ, Zhang EZ, et al., 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Not, 48(8):57-68. [23]Yang Y, Xiang P, Kong JF, et al., 2010. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Not, 45(6):86-97. [24]Zhang EZ, Jiang YL, Guo ZY, et al., 2011. On-the-fly elimination of dynamic irregularities for GPU computing. ACM SIGARCH Comput Arch News, 39(1):369-380. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>