Full Text:   <980>

Summary:  <406>

CLC number: TP312

On-line Access: 2015-11-04

Received: 2015-01-30

Revision Accepted: 2015-06-30

Crosschecked: 2015-10-19

Cited: 0

Clicked: 2092

Citations:  Bibtex RefMan EndNote GB/T7714


Da-fei Huang


-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2015 Vol.16 No.11 P.899-916


Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Author(s):  Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen

Affiliation(s):  School of Computer, National University of Defense Technology, Changsha 410073, China; more

Corresponding email(s):   meiwen@nudt.edu.cn, huangdafei1012@163.com, xunchangqing@nudt.edu.cn, chendong@nudt.edu.cn

Key Words:  OpenCL, Performance portability, Multi-core/many-core CPU, Analysis-based transformation

Share this article to: More |Next Article >>>

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen. Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations[J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16(11): 899-916.

@article{title="Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations",
author="Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
%A Mei Wen
%A Da-fei Huang
%A Chang-qing Xun
%A Dong Chen
%J Frontiers of Information Technology & Electronic Engineering
%V 16
%N 11
%P 899-916
%@ 2095-9184
%D 2015
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1500032

T1 - Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations
A1 - Mei Wen
A1 - Da-fei Huang
A1 - Chang-qing Xun
A1 - Dong Chen
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 16
IS - 11
SP - 899
EP - 916
%@ 2095-9184
Y1 - 2015
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1500032

openCL is an open heterogeneous programming framework. Although openCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific openCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific openCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of openCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific openCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new openCL runtime. Experiments show that the automated transformation can improve openCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

In this paper, the authors present a transformation approach for GPU-specific OpenCL kernels targeting multi-/many-core CPUs. In particular, they remove local memory usage (and the related synchronization) when found unnecessary, and introduce post-optimizations taking both vectorization and data locality into account. The experimental evaluation shows that their method leads to good performance compared to Intel’s OpenCL implementation and OpenMP.

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性

目的:针对面向GPU设计的OpenCL kernel程序在CPU上性能移植性欠佳这一问题,设计一种基于访存特征分析的代码转换方法,提升性能移植性。
创新点:通过分析OpenCL kernel中的访存模式,去除不必要的局部存储数组及其带来的同步语句,并使用向量化和局域性重开发进一步优化代码,最终取得显著的性能提升。
方法:首先,针对OpenCL kernel代码中的数组访问,设计一种精确的线性化访问描述子(图2)。然后,利用该描述子,分两步对GPU特定的OpenCL kernel代码进行转换,以提高其在CPU上的性能(图7)。第一步为基于分析的work-item折叠,即通过分析访问描述子,找出并去除不必要的局部存储数组及其带来的同步语句,然后完成work-item折叠。第二步为适应架构的代码优化,即针对CPU架构的特点,使用向量化和局域性重开发进一步优化折叠后的代码。最后,上述代码转换过程被整合为一个工具链,连同一个调度程序,嵌入到一个开源的OpenCL运行时系统中(图11)。实验结果表明,这种转换方法可以显著提升GPU特定的OpenCL kernel在Intel Sandy Bridge架构CPU和Intel Knights Corner架构协处理器上的性能。
结论:准确分析OpenCL kernel代码中的访存模式,不仅利于判断局部存储数组是否适合于CPU架构,还能用于指导之后的代码优化过程,因此是提高性能移植性的重要步骤。


Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Allen, R., Kennedy, K., 2002. Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann, San Francisco.

[2]Balasundaram, V., Kennedy, K., 1989. A technique for summarizing data access and its use in parallelism enhancing transformations. ACM SIGPLAN Not., 24(7):41-53.

[3]Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc. 22nd Annual Int. Conf. on Supercomputing, p.225-234.

[4]Bastoul, C., 2004. Code generation in the polyhedral model is easier than you think. Proc. 13th Int. Conf. on Parallel Architectures and Compilation Techniques, p.7-16.

[5]Danalis, A., Marin, G., McCurdy, C., et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.

[6]Dong, H., Ghosh, D., Zafar, F., et al., 2012. Cross-platform OpenCL code and performance portability for CPU and GPU architectures investigated with a climate and weather physics model. Proc. 41st Int. Conf. on Parallel Processing Workshops, p.126-134.

[7]Du, P., Weber, R., Luszczek, P., et al., 2012. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parall. Comput., 38(8):391-407.

[8]Fang, J., Sips, H., Jaaskelainen, P., et al., 2014a. Grover: looking for performance improvement by disabling local memory usage in OpenCL kernels. Proc. 43rd Int. Conf. on Parallel Processing, p.162-171.

[9]Fang, J., Sips, H., Varbanescu, A.L., 2014b. Aristotle: a performance impact indicator for the OpenCL kernels using local memory. Sci. Progr., 22(3):239-257.

[10]Freeocl, 2012. FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs. Available from https://code.google.com/p/freeocl [Accessed on Apr. 13, 2014].

[11]Gummaraju, J., Morichetti, L., Houston, M., et al., 2010. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.205-216.

[12]Huang, D., Wen, M., Xun, C., et al., 2014. Automated transformation of GPU-specific OpenCL kernels targeting performance portability on multi-core/many-core CPUs. Proc. Euro-Par, p.210-221.

[13]Intel Corporation, 2012. A Guide to Vectorization with Intel C++ Compilers.

[14]Intel Corporation, 2013a. Intel C++ Intrinsic Reference. Available from https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf [Accessed on Feb. 9, 2014]

[15]Intel Corporation, 2013b. Intel SDK for OpenCL Applications XE 2013 Optimization Guide. Available from http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe/ [Accessed on Feb. 9, 2014]

[16]Jang, B., Schaa, D., Mistry, P., et al., 2011. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parall. Distr. Syst., 22(1):105-118.

[17]Lattner, C., Adve, V., 2005. The LLVM compiler framework and infrastructure tutorial. In: Eigenmann, R., Li, Z.Y., Midkiff, S.P. (Eds.), Languages and Compilers for High Performance Computing. Springer, p.15-16.

[18]Lee, J., Kim, J., Seo, S., et al., 2010. An OpenCL framework for heterogeneous multicores with local memory. Proc. 19th Int. Conf. on Parallel Architectures and Compilation Techniques, p.193-204.

[19]LLVM Team and others, 2012. Clang: a C language family frontend for LLVM. Available from http://clang.llvm.org/ [Accessed on Apr. 13, 2014].

[20]Munshi, A., 2011. The OpenCL specification. Available from http://www.khronos.org/opencl [Accessed on Apr. 12, 2014]

[21]Nvidia Corporation, 2011a. OpenCL Best Practices Guide. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Best_Practices_Guide.pdf [Accessed on Feb. 10, 2014].

[22]Nvidia Corporation, 2011b. OpenCL Programming Guide for the CUDA Architecture. Available from https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Programming_Guide.pdf [Accessed on Feb. 10, 2014].

[23]Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65-109.

[24]Pennycook, S.J., Hammond, S.D., Wright, S.A., et al., 2013. An investigation of the performance portability of OpenCL. J. Parall. Distr. Comput., 73(11):1439-1450.

[25]Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., et al., 2013. Portable performance on heterogeneous architectures. Proc. 18th Int. Conf. on Architechtural Support for Programming Languages and Operating Systems, p.431-444.

[26]Rul, S., Vandierendonck, H., D’Haene, J., et al., 2010. An experimental study on performance portability of OpenCL kernels. Symp. on Application Accelerators in High Performance Computing. Available from https://biblio.ugent.be/publication/1016024

[27]Shen, Z., Li, Z., Yew, P., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356-364.

[28]Steven, S.M., 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco.

[29]Stratton, J.A., Stone, S.S., Hwu, W.M.W., 2008. MCUDA: an effective implementation of CUDA kernels for multi-core CPUs. Proc. 21st Int. Workshop on Languages and Compilers for Parallel Computing, p.16-30.

[30]Stratton, J.A., Grover, V., Marathe, J., et al., 2010. Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. Proc. 8th Annual IEEE/ACM Int. Symp. on Code Generation and Optimization, p.111-119.

[31]Stratton, J.A., Kim, H., Jablin, T.B., et al., 2013. Performance portability in accelerated parallel kernels. Technical Report No. IMPACT-13-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, IL.

[32]TOP500.org, 2014. TOP500 lists: November 2014. Available from http://top500.org/lists/2014/11/ [Accessed on Nov. 29, 2014].

[33]Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. ACM SIGPLAN Not., 21(7):176-185.

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE