
CLC number: TP39
On-line Access: 2025-10-13
Received: 2024-09-21
Revision Accepted: 2025-01-24
Crosschecked: 2025-10-13
Cited: 0
Clicked: 1191
Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM. SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400867 @article{title="SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator", %0 Journal Article TY - JOUR
SAPER-AI加速器:一种基于脉动阵列的低能耗可重构人工智能加速器Safiullah KHAN4, Tayyeb MAHMOOD5, Ihtesham ul ISLAM6 1GIK工程科技学院计算机科学与工程系,巴基斯坦托皮,23460 2巴塞罗那超算中心,西班牙巴塞罗那,1-3 08034 3仁川国立大学电子工程系,韩国仁川市,22006 4曼彻斯特城市大学计算与数学系,英国曼彻斯特,M15 6BX 5Nextwave公司,韩国大田市,34134 6国立科技大学计算机软件工程系,巴基斯坦伊斯兰堡,H-12 摘要:深度学习加速器对于满足现代神经网络日益增长的计算需求至关重要。基于脉动阵列的加速器由二维网格状的处理元件组成,这些处理元件协同工作以加速矩阵乘法运算。此类加速器的能效至关重要,尤其是在边缘人工智能场景下。本文提出的SAPER-AI加速器是一种脉动阵列加速器,采用统一功耗格式来定义其功耗意图,几乎无需优化其微架构。该加速器以粗粒度方式关闭处理元件的行和列,从而使脉动阵列微架构能适应现代深度学习工作负载不断变化的计算需求。分析表明,在最佳情况下,32×32和64×64脉动阵列设计的能效分别提升10%至25%。此外,更大规模的脉动阵列的功率延迟积改善约6%。进一步,基于MobileNet和ResNet50模型的性能比较表明,脉动阵列在处理ResNet50工作负载时通常表现更佳。这是因为ResNet50所呈现的更为规整的卷积计算更受脉动阵列青睐,且随着脉动阵列规模增大,这一性能差距会进一步扩大。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Amirshahi A, Ansaloni G, Atienza D, 2023a. Accelerator-driven data arrangement to minimize transformers run-time on multi-core architectures. 15th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, p.2:1-2:13. ![]() [2]Amirshahi A, Klein JAH, Ansaloni G, et al., 2023b. TiC-SAT: tightly-coupled systolic accelerator for transformers. Proc 28th Asia and South Pacific Design Automation Conf, p.657-663. ![]() [3]Bobda C, Mbongue JM, Chow P, et al., 2022. The future of FPGA acceleration in datacenters and the cloud. ACM Trans Reconfig Technol Syst, 15(3):34. ![]() [4]Chadha R, Bhasker J, 2012. An ASIC Low Power Primer: Analysis, Techniques and Specification. Springer Science & Business Media, New York, NY, USA. ![]() [5]Chang SW, Kim DS, 2024. Scalable Transformer accelerator with variable systolic array for multiple models in voice assistant applications. Electronics, 13(23):4683. ![]() [6]Chen YH, Emer J, Sze V, 2016. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News, 44(3):367-379. ![]() [7]Chen YH, Yang TJ, Emer J, et al., 2019. Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J Emerg Sel Top Circ Syst, 9(2):292-308. ![]() [8]Chen YJ, Luo T, Liu SL, et al., 2014. DaDianNao: a machine-learning supercomputer. 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.609-622. ![]() [9]Chen YJ, Chen TS, Xu ZW, et al., 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun ACM, 59(11):105-112. ![]() [10]Gao MY, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. ACM SIGARCH Comput Archit News, 45(1):751-764. ![]() [11]Genc H, Kim S, Amid A, et al., 2021. Gemmini: enabling systematic deep-learning architecture evaluation via full-stack integration. Proc 58th Annual Design Automation Conf, p.769-774. ![]() [12]Guo C, Xue FC, Leng JW, et al., 2024. Accelerating sparse DNNs based on tiled GEMM. IEEE Trans Comput, 73(5):1275-1289. ![]() [13]Hoefler T, Alistarh D, Ben-Nun T, et al., 2021. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J Mach Learn Res, 22(1):124. ![]() [14]Inayat K, Muslim FB, Iqbal J, et al., 2023. Power-intent systolic array using modified parallel multiplier for machine learning acceleration. Sensors, 23(9):4297. ![]() [15]Inayat K, Muslim FB, Mahmood T, et al., 2024. FPGA-assisted design space exploration of parameterized AI accelerators: a Quickloop approach. J Syst Archit, 155:103260. ![]() [16]Jouppi NP, Young C, Patil N, et al., 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44th Annual Int Symp on Computer Architecture, p.1-12. ![]() [17]Kim B, Lee S, Trivedi AR, et al., 2020. Energy-efficient acceleration of deep neural networks on realtime-constrained embedded edge devices. IEEE Access, 8:216259-216270. ![]() [18]Lai CT, Zhang W, 2024. gem5-NVDLA: a simulation framework for compiling, scheduling and architecture evaluation on AI system-on-chips. ACM Trans Des Autom Electr Systt, 29(5):84. ![]() [19]Lee J, Kim C, Kang S, et al., 2018. An Energy-Efficient Unified Deep Neural Network Accelerator with Fully-Variable Weight Precision for Mobile Deep Learning Applications. https://old.hotchips.org/hc30/3posters/An_EnergyEfficient_Unified_Deep_Neural_Network_Accelerator.pdf [Accessed on Sept. 20, 2024]. ![]() [20]Li WQ, Liu TY, Xiao ZY, et al., 2023. TCADer: a tightly coupled accelerator design framework for heterogeneous system with hardware/software co-design. J Syst Archit, 136:102822. ![]() [21]Loh J, Dudchenko L, Viga J, et al., 2025. Towards hardware supported domain generalization in DNN-based edge computing devices for health monitoring. IEEE Trans Biomed Circ Syst, 19(1):5-15. ![]() [22]Lym S, Erez M, 2020. FlexSA: flexible systolic array architecture for efficient pruned DNN model training. ![]() [23]Moghaddasi I, Nam BG, 2024. Enhancing computation-efficiency of deep neural network processing on edge devices through serial/parallel systolic computing. Mach Learn Knowl Extr, 6(3):1484-1493. ![]() [24]Moons B, De Brabandere B, Van Gool L, et al., 2016. Energy-efficient ConvNets through approximate computing. IEEE Winter Conf on Applications of Computer Vision, p.1-8. ![]() [25]Moons B, Uytterhoeven R, Dehaene W, et al., 2017a. 14.5 envision: a 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOIF. IEEE Int Solid-State Circuits Conf, p.246-247. ![]() [26]Moons B, Uytterhoeven R, Dehaene W, et al., 2017b. DVAFS: trading computational accuracy for energy through dynamic-voltage-accuracy-frequency-scaling. Design, Automation & Test in Europe Conf & Exhibition, p.488-493. ![]() [27]Muslim FB, Inayat K, Khan S, 2024. LPCHISEL: automatic power intent generation for a chisel-based ASIC design. Comput Electr Eng, 115:109115. ![]() [28]Park M, Hwang S, Cho H, 2024. BIRD: bi-directional input reuse dataflow for enhancing depthwise convolution performance on systolic arrays. IEEE Trans Comput, 73(12):2708-2721. ![]() [29]Qamar A, Muslim FB, Iqbal J, et al., 2017. LP-HLS: automatic power-intent generation for high-level synthesis based hardware implementation flow. Microproc Microsyst, 50:26-38. ![]() [30]Ryu S, Kim H, Yi W, et al., 2019. BitBlade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. Proc 56th ACM/IEEE Design Automation Conf, p.1-6. ![]() [31]Seshadri K, Akin B, Laudon J, et al., 2022. An evaluation of edge TPU accelerators for convolutional neural networks. IEEE Int Symp on Workload Characterization, p.79-91. ![]() [32]Song J, Cho Y, Park JS, et al., 2019. 7.1 an 11.5 TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile SoC. IEEE Int Solid-State Circuits Conf, p.130-132. ![]() [33]Xu R, Ma S, Wang YH, et al., 2021a. Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans Archit Code Optim, 18(4):42. ![]() [34]Xu R, Ma S, Wang YH, et al., 2021b. Heterogeneous systolic array architecture for compact CNNs hardware accelerators. IEEE Trans Parallel Distrib Syst, 33(11):2860-2871. ![]() [35]Xu R, Ma S, Guo Y, et al., 2023. A survey of design and optimization for systolic array-based DNN accelerators. ACM Comput Surv, 56(1):20. ![]() [36]Yüzügüler AC, Sönmez C, Drumond M, et al., 2023. Scale-out systolic arrays. ACM Trans Archit Code Optim, 20(2):27. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>