JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Optimizing non-coalesced memory access for irregular applications with GPU computing

Author(s): Ran Zheng, Yuan-dong Liu, Hai Jin
Affiliation(s): National Engineering Research Center for Big Data Technology and System, Huazhong University of Science and Technology, Wuhan 430074, China; more
Corresponding email(s): zhraner@hust.edu.cn, 1531364016@qq.com, hjin@hust.edu.cn
Key Words: General purpose graphics processing units, Memory coalescing, Non-coalesced memory access, Data reordering

Share this article to： More <<< Previous Paper \|Next Paper >>>

Ran Zheng, Yuan-dong Liu, Hai Jin. Optimizing non-coalesced memory access for irregular applications with GPU computing[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900262

@article{title="Optimizing non-coalesced memory access for irregular applications with GPU computing",
author="Ran Zheng, Yuan-dong Liu, Hai Jin",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1900262"
}

%0 Journal Article
%T Optimizing non-coalesced memory access for irregular applications with GPU computing
%A Ran Zheng
%A Yuan-dong Liu
%A Hai Jin
%J Frontiers of Information Technology & Electronic Engineering
%P 1285-1301
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1900262"

TY - JOUR
T1 - Optimizing non-coalesced memory access for irregular applications with GPU computing
A1 - Ran Zheng
A1 - Yuan-dong Liu
A1 - Hai Jin
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1285
EP - 1301
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1900262"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: general purpose graphics processing units (GPGPUs) can be used to improve computing performance considerably for regular applications. However, irregular memory access exists in many applications, and the benefits of graphics processing units (GPUs) are less substantial for irregular applications. In recent years, several studies have presented some solutions to remove static irregular memory access. However, eliminating dynamic irregular memory access with software remains a serious challenge. A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access, especially for indirect memory access. data reordering and index redirection are suggested to reduce the number of memory transactions, thereby improving the performance of GPU kernels. To improve the efficiency of data reordering, an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data. Through concurrently executing the compute unified device architecture (CUDA) streams of data reordering and the data processing kernel, the overhead of data reordering can be reduced. After these optimizations, the volume of memory transactions can be reduced by 16.7%–50% compared with CUSPARSE-based benchmarks, and the performance of irregular kernels can be improved by 9.64%–34.9% using an NVIDIA Tesla P4 GPU.

运用GPU计算面向非规则应用的非合并内存访问优化

郑然^1,2,3,4，刘元栋^1,2,3,4，金海^1,2,3,4
¹华中科技大学大数据技术与系统国家地方联合工程研究中心，中国武汉市，430074
²华中科技大学服务计算技术与系统实验室，中国武汉市，430074
³华中科技大学集群与网格计算实验室，中国武汉市，430074
⁴华中科技大学计算机科学与技术学院，中国武汉市，430074

摘要：通用图形处理器（GPGPU）可大大提升规则应用的计算性能。然而，很多应用中存在非规则内存访问模式，大大限制了GPU的性能优势。近年来，一些研究提出解决方案来移除静态非规则内存访问。然而，利用软件消除动态非规则内存访问仍然面临严峻挑战。本文提出一种纯软件解决方案用于消除动态非规则内存访问，尤其是间接内存访问，无需硬件扩展和离线分析。提出数据重组和索引重定向以减少内存访问次数，从而提高GPU内核性能。为提高数据重组效率，卸载重组数据操作至GPU以降低开销并传输数据。通过并发执行数据重组和数据处理内核的统一计算设备架构（CUDA）流，可降低数据重组开销。完成这些优化后，相比于CUSPARSE基准测试，使用该方法GPU内核的内存数据传输减少了16.7%-50%；同时，NVIDIA Tesla P4 GPU上的内核性能提高了9.64%-34.9%。

关键词组：通用图形处理器；内存合并；非合并内存访问；数据重组

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alandoli M, Shehab M, Al-Ayyoub M, et al., 2016. Using GPUs to speed-up FCM-based community detection in social networks. Proc 7^th Int Conf on Computer Science and Information Technology, p.1-6.

[2]Baskaran MM, Bordawekar R, 2008. Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-Time and Run-Time Strategies. IBM Research Report, RC24704 (W0812-047).

[3]Baskaran MM, Bondhugula U, Krishnamoorthy S, et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc 22^nd Annual Int Conf on Supercomputing, p.225-234.

[4]Bhatotia P, Rodrigues R, Verma A, 2012. Shredder: GPU-accelerated incremental storage and computation. Proc 10^th USENIX Conf on File and Storage Technologies, p.14.

[5]Burtscher M, Nasre R, Pingali K, 2012. A quantitative study of irregular programs on GPUs. Proc IEEE Int Symp on Workload Characterization, p.141-151.

[6]Danalis A, Marin G, McCurdy C, et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc 3^rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.

[7]Davis TA, Hu YF, 2011. The university of Florida sparse matrix collection. ACM Trans Math Softw, 38(1):1.

[8]Fan WS, Wang B, Paul JC, et al., 2013. An octree-based proxy for collision detection in large-scale particle systems. Sci China Inform Sci, 56(1):1-10.

[9]Fung WW, Sham I, Yuan G, et al., 2007. Dynamic warp formation and scheduling for efficient GPU control flow. Proc 40^th Annual IEEE/ACM Int Symp on Microarchitecture, p.407-420.

[10]Lee S, Min S, Eigenmann R, 2009. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM SIGPLAN Not, 44(4):101-110.

[11]Li KL, Yang WD, Li KQ, 2015. Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans Parall Distrib Syst, 26(1):196-205.

[12]Li KL, Yang WD, Li KQ, 2016. A hybrid parallel solving algorithm on GPU for quasi-tridiagonal system of linear equations. IEEE Trans Parall Distrib Syst, 27(10):2795-2808.

[13]Meng JY, Tarjan D, Skadron K, 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput Arch News, 38(3):235-246.

[14]Mokhtari R, Stumm M, 2014. BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. Proc IEEE 28^th Int Parallel and Distributed Processing Symp, p.819-828.

[15]Pickering BP, Jackson CW, Scogland TRW, et al., 2015. Directive-based GPU programming for computational fluid dynamics. Comput Fluids, 114:242-253.

[16]Sabne A, Sakdhnagool P, Eigenmann R, 2013. Scaling large-data computations on multi-GPU accelerators. Proc 27^th Int ACM Conf on Supercomputing, p.443-454.

[17]Song W, Wu D, Wong R, et al., 2015. A real-time interactive data mining and visualization system using parallel computing. Proc 10^th Int Conf on Digital Information Management, p.10-13.

[18]Sourouri M, Gillberg T, Baden SB, et al., 2014. Effective multi-GPU communication using multiple CUDA streams and threads. Proc 20^th IEEE Int Conf on Parallel and Distributed Systems, p.981-986.

[19]Ueng SZ, Lathara M, Baghsorkhi SS, et al., 2008. CUDA-Lite: reducing GPU programming complexity. Proc 21^st Int Workshop on Languages and Compilers for Parallel Computing, p.1-15.

[20]Wang L, Spurzem R, Aarseth S, et al., 2015. NBODY6++GPU: ready for the gravitational million-body problem. Mon Not R Astron Soc, 450(4):4070-4080.

[21]Wang Y, Davidson A, Pan YC, et al., 2015. Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Not, 50(8):265-266.

[22]Wu B, Zhao ZJ, Zhang EZ, et al., 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIGPLAN Not, 48(8):57-68.

[23]Yang Y, Xiang P, Kong JF, et al., 2010. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Not, 45(6):86-97.

[24]Zhang EZ, Jiang YL, Guo ZY, et al., 2011. On-the-fly elimination of dynamic irregularities for GPU computing. ACM SIGARCH Comput Arch News, 39(1):369-380.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

运用GPU计算面向非规则应用的非合并内存访问优化

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference