CLC number: TP315
On-line Access: 2023-05-06
Received: 2022-08-27
Revision Accepted: 2023-05-06
Crosschecked: 2022-10-19
Cited: 0
Clicked: 1222
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0003-3542-4869
Jianbin FANG, Peng ZHANG, Chun HUANG, Tao TANG, Kai LU, Ruibo WANG, Zheng WANG. Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200359 @article{title="Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000", %0 Journal Article TY - JOUR
以Matrix-3000为例研究面向裸金属加速器的异构多线程编程模型1国防科技大学计算机学院,中国长沙市,410073 2利兹大学计算学院,英国利兹市,LS2 9JT 摘要:随着处理器设计转向使用专门的异构多核以避免功耗墙的影响,软件开发人员发现很难处理这些处理器系统的复杂性。以Matrix-3000为代表的新型处理器具有复杂的内存层次结构和处理器组织,是为下一代E级超级计算机设计的高性能处理器。本文分享了我们为Matrix-3000开发的并行编程模型及其支持编译器和库的经验。为了帮助软件开发,我们从头开始开发了一个针对Matrix-3000的软件栈,包括一个低层次的编程接口和一个高层次的OpenCL编译器。该低层次编程模型为使用Matrix-3000的裸金属加速器提供了原生编程支持,而高层次模型允许程序员使用OpenCL并行编程标准。我们详细介绍了该软件栈的设计选择,并强调了从开发系统软件中学到的经验教训,以实现裸金属加速器的高效程序编写和性能解锁。我们的编程模型已经被部署到一个E级原型系统的生产环境中。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Alfieri RA, 1994. An efficient kernel-based implementation of POSIX threads. Proc USENIX Summer Technical Conf, p.59-72. [2]Arevalo A, Matinata RM, Pandian M, et al., 2000. Programming the cell broadband engine examples and best practices. ACM Workshop. Available from https://www.autodesk.com/research/publications/programming-the-cell-broadband [Accessed on Aug. 25, 2022]. [3]Fang JB, Varbanescu AL, Sips H, 2011. A comprehensive performance comparison of CUDA and OpenCL. Int Conf on Parallel Processing, p.216-225. [4]Fang JB, Huang C, Tang T, et al., 2020. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans High Perform Comput, 2(4):382-400. [5]Jääskeläinen P, de la Lama CS, Schnetter E, et al., 2015. pocl: a performance-portable OpenCL implementation. Int J Parall Program, 43(5):752-785. [6]Kudlur M, Mahlke S, 2008. Orchestrating the execution of stream programs on multicore platforms. Proc 29th ACM SIGPLAN Conf on Programming Language Design and Implementation, p.114-124. [7]Liao XK, Lu K, Yang CQ, et al., 2018. Moving from exascale to zettascale computing: challenges and techniques. Front Inform Technol Electron Eng, 19(10):1236-1244. [8]Lu K, Wang YH, Guo Y, et al., 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans High Perform Comput, 4(2):150-164. [9]Owens JD, Luebke D, Govindaraju N, et al., 2005. A survey of general-purpose computation on graphics hardware. Proc 26th Annual Conf of the European Association for Computer Graphics, p.21-51. [10]Owens JD, Houston M, Luebke D, et al., 2008. GPU computing. Proc IEEE, 96(5):879-899. [11]Patterson D, 2018. 50 years of computer architecture: from the mainframe CPU to the domain-specific TPU and the open RISC-V instruction set. IEEE Int Solid-State Circuits Conf, p.27-31. [12]Perez JM, Bellens P, Badia RM, et al., 2007. CellSs: making it easier to program the cell broadband engine processor. IBM J Res Dev, 51(5):593-604. [13]Shen J, Fang JB, Sips H, et al., 2012. Performance gaps between OpenMP and OpenCL for multi-core CPUs. Proc 41st Int Conf on Parallel Processing Workshops, p.116-125. [14]Trott CR, Lebrun-Grandié D, Arndt D, et al., 2022. Kokkos 3: programming model extensions for the exascale era. IEEE Trans Parall Distrib Syst, 33(4):805-817. [15]Zhai JD, Chen WG, 2018. A vision of post-exascale programming. Front Inform Technol Electron Eng, 19(10):1261-1266. [16]Zhang P, Tang T, Fang J, et al., 2018. MOCL: an efficient OpenCL implementation for the Matrix-2000 architecture. Proc 15th ACM Int Conf on Computing Frontiers, p.26-35. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>