Full Text:   <980>

Summary:  <406>

CLC number: TP312

On-line Access: 2015-11-04

Received: 2015-01-30

Revision Accepted: 2015-06-30

Crosschecked: 2015-10-19

Cited: 0

Clicked: 2092

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Da-fei Huang

http://orcid.org/0000-0001-6617-7608

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2015 Vol.16 No.11 P.899-916

http://doi.org/10.1631/FITEE.1500032


Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations


Author(s):  Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen

Affiliation(s):  School of Computer, National University of Defense Technology, Changsha 410073, China; more

Corresponding email(s):   meiwen@nudt.edu.cn, huangdafei1012@163.com, xunchangqing@nudt.edu.cn, chendong@nudt.edu.cn

Key Words:  OpenCL, Performance portability, Multi-core/many-core CPU, Analysis-based transformation


Share this article to: More |Next Article >>>

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen. Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations[J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16(11): 899-916.

@article{title="Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations",
author="Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="16",
number="11",
pages="899-916",
year="2015",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1500032"
}

%0 Journal Article
%T Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
%A Mei Wen
%A Da-fei Huang
%A Chang-qing Xun
%A Dong Chen
%J Frontiers of Information Technology & Electronic Engineering
%V 16
%N 11
%P 899-916
%@ 2095-9184
%D 2015
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500032

TY - JOUR
T1 - Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
A1 - Mei Wen
A1 - Da-fei Huang
A1 - Chang-qing Xun
A1 - Dong Chen
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 16
IS - 11
SP - 899
EP - 916
%@ 2095-9184
Y1 - 2015
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500032


Abstract: 
openCL is an open heterogeneous programming framework. Although openCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific openCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific openCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of openCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific openCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new openCL runtime. Experiments show that the automated transformation can improve openCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

In this paper, the authors present a transformation approach for GPU-specific OpenCL kernels targeting multi-/many-core CPUs. In particular, they remove local memory usage (and the related synchronization) when found unnecessary, and introduce post-optimizations taking both vectorization and data locality into account. The experimental evaluation shows that their method leads to good performance compared to Intel’s OpenCL implementation and OpenMP.

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性

目的:针对面向GPU设计的OpenCL kernel程序在CPU上性能移植性欠佳这一问题,设计一种基于访存特征分析的代码转换方法,提升性能移植性。
创新点:通过分析OpenCL kernel中的访存模式,去除不必要的局部存储数组及其带来的同步语句,并使用向量化和局域性重开发进一步优化代码,最终取得显著的性能提升。
方法:首先,针对OpenCL kernel代码中的数组访问,设计一种精确的线性化访问描述子(图2)。然后,利用该描述子,分两步对GPU特定的OpenCL kernel代码进行转换,以提高其在CPU上的性能(图7)。第一步为基于分析的work-item折叠,即通过分析访问描述子,找出并去除不必要的局部存储数组及其带来的同步语句,然后完成work-item折叠。第二步为适应架构的代码优化,即针对CPU架构的特点,使用向量化和局域性重开发进一步优化折叠后的代码。最后,上述代码转换过程被整合为一个工具链,连同一个调度程序,嵌入到一个开源的OpenCL运行时系统中(图11)。实验结果表明,这种转换方法可以显著提升GPU特定的OpenCL kernel在Intel Sandy Bridge架构CPU和Intel Knights Corner架构协处理器上的性能。
结论:准确分析OpenCL kernel代码中的访存模式,不仅利于判断局部存储数组是否适合于CPU架构,还能用于指导之后的代码优化过程,因此是提高性能移植性的重要步骤。

关键词:OpenCL;性能移植性;多核/众核CPU;基于分析的转换

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Allen, R., Kennedy, K., 2002. Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann, San Francisco.

[2]Balasundaram, V., Kennedy, K., 1989. A technique for summarizing data access and its use in parallelism enhancing transformations. ACM SIGPLAN Not., 24(7):41-53.

[3]Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc. 22nd Annual Int. Conf. on Supercomputing, p.225-234.

[4]Bastoul, C., 2004. Code generation in the polyhedral model is easier than you think. Proc. 13th Int. Conf. on Parallel Architectures and Compilation Techniques, p.7-16.

[5]Danalis, A., Marin, G., McCurdy, C., et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.

[6]Dong, H., Ghosh, D., Zafar, F., et al., 2012. Cross-platform OpenCL code and performance portability for CPU and GPU architectures investigated with a climate and weather physics model. Proc. 41st Int. Conf. on Parallel Processing Workshops, p.126-134.

[7]Du, P., Weber, R., Luszczek, P., et al., 2012. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parall. Comput., 38(8):391-407.

[8]Fang, J., Sips, H., Jaaskelainen, P., et al., 2014a. Grover: looking for performance improvement by disabling local memory usage in OpenCL kernels. Proc. 43rd Int. Conf. on Parallel Processing, p.162-171.

[9]Fang, J., Sips, H., Varbanescu, A.L., 2014b. Aristotle: a performance impact indicator for the OpenCL kernels using local memory. Sci. Progr., 22(3):239-257.

[10]Freeocl, 2012. FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs. Available from https://code.google.com/p/freeocl [Accessed on Apr. 13, 2014].

[11]Gummaraju, J., Morichetti, L., Houston, M., et al., 2010. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.205-216.

[12]Huang, D., Wen, M., Xun, C., et al., 2014. Automated transformation of GPU-specific OpenCL kernels targeting performance portability on multi-core/many-core CPUs. Proc. Euro-Par, p.210-221.

[13]Intel Corporation, 2012. A Guide to Vectorization with Intel C++ Compilers.

[14]Intel Corporation, 2013a. Intel C++ Intrinsic Reference. Available from https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf [Accessed on Feb. 9, 2014]

[15]Intel Corporation, 2013b. Intel SDK for OpenCL Applications XE 2013 Optimization Guide. Available from http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe/ [Accessed on Feb. 9, 2014]

[16]Jang, B., Schaa, D., Mistry, P., et al., 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parall. Distr. Syst., 22(1):105-118.

[17]Lattner, C., Adve, V., 2005. The LLVM compiler framework and infrastructure tutorial. In: Eigenmann, R., Li, Z.Y., Midkiff, S.P. (Eds.), Languages and Compilers for High Performance Computing. Springer, p.15-16.

[18]Lee, J., Kim, J., Seo, S., et al., 2010. An OpenCL framework for heterogeneous multicores with local memory. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.193-204.

[19]LLVM Team and others, 2012. Clang: a C language family frontend for LLVM. Available from http://clang.llvm.org/ [Accessed on Apr. 13, 2014].

[20]Munshi, A., 2011. The OpenCL specification. Available from http://www.khronos.org/opencl [Accessed on Apr. 12, 2014]

[21]Nvidia Corporation, 2011a. OpenCL Best Practices Guide. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Best_Practices_Guide.pdf [Accessed on Feb. 10, 2014].

[22]Nvidia Corporation, 2011b. OpenCL Programming Guide for the CUDA Architecture. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Programming_Guide.pdf [Accessed on Feb. 10, 2014].

[23]Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65-109.

[24]Pennycook, S.J., Hammond, S.D., Wright, S.A., et al., 2013. An investigation of the performance portability of OpenCL. J. Parall. Distr. Comput., 73(11):1439-1450.

[25]Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., et al., 2013. Portable performance on heterogeneous architectures. Proc. 18th Int. Conf. on Architechtural Support for Programming Languages and Operating Systems, p.431-444.

[26]Rul, S., Vandierendonck, H., D’Haene, J., et al., 2010. An experimental study on performance portability of OpenCL kernels. Symp. on Application Accelerators in High Performance Computing. Available from https://biblio.ugent.be/publication/1016024

[27]Shen, Z., Li, Z., Yew, P., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356-364.

[28]Steven, S.M., 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco.

[29]Stratton, J.A., Stone, S.S., Hwu, W.M.W., 2008. MCUDA: an effective implementation of CUDA kernels for multi-core CPUs. Proc. 21st Int. Workshop on Languages and Compilers for Parallel Computing, p.16-30.

[30]Stratton, J.A., Grover, V., Marathe, J., et al., 2010. Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. Proc. 8th Annual IEEE/ACM Int. Symp. on Code Generation and Optimization, p.111-119.

[31]Stratton, J.A., Kim, H., Jablin, T.B., et al., 2013. Performance portability in accelerated parallel kernels. Technical Report No. IMPACT-13-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, IL.

[32]TOP500.org, 2014. TOP500 lists: November 2014. Available from http://top500.org/lists/2014/11/ [Accessed on Nov. 29, 2014].

[33]Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. ACM SIGPLAN Not., 21(7):176-185.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE