JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.9 P.1624-1636

SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator

Author(s): Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM
Affiliation(s): Faculty of Computer Science and Engineering, GIK Institute, Topi 23460, Pakistan; more
Corresponding email(s): fahad.muslim@giki.edu.pk
Key Words: Artificial intelligence (AI) accelerators, Application-specific integrated circuit (ASIC) design, Systolic arrays, Low-power designs

Share this article to： More <<< Previous Article \|Next Article >>>

Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM. SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(9): 1624-1636.

@article{title="SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator",
author="Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="9",
pages="1624-1636",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400867"
}

%0 Journal Article
%T SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator
%A Fahad Bin MUSLIM
%A Kashif INAYAT
%A Muhammad Zain SIDDIQI
%A Safiullah KHAN
%A Tayyeb MAHMOOD
%A Ihtesham ul ISLAM
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 9
%P 1624-1636
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400867

TY - JOUR
T1 - SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator
A1 - Fahad Bin MUSLIM
A1 - Kashif INAYAT
A1 - Muhammad Zain SIDDIQI
A1 - Safiullah KHAN
A1 - Tayyeb MAHMOOD
A1 - Ihtesham ul ISLAM
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 9
SP - 1624
EP - 1636
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400867

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Deep learning (DL) accelerators are critical for handling the growing computational demands of modern neural networks. Systolic array (SA)-based accelerators consist of a 2D mesh of processing elements (PEs) working cooperatively to accelerate matrix multiplication. The power efficiency of such accelerators is of primary importance, especially considering the edge AI regime. This work presents the SAPER-AI accelerator, an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchitectural optimization effort. Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner, thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads. Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs, respectively. Additionally, the power delay product (PDP) exhibits a progressive improvement of around 6% for larger SA sizes. Moreover, a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload. This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs, with the performance gap widening as the SA size increases.

SAPER-AI加速器：一种基于脉动阵列的低能耗可重构人工智能加速器

Fahad Bin MUSLIM¹, Kashif INAYAT^2,3,Muhammad Zain SIDDIQI¹,
Safiullah KHAN⁴, Tayyeb MAHMOOD⁵, Ihtesham ul ISLAM⁶
¹GIK工程科技学院计算机科学与工程系，巴基斯坦托皮，23460
²巴塞罗那超算中心，西班牙巴塞罗那，1-3 08034
³仁川国立大学电子工程系，韩国仁川市，22006
⁴曼彻斯特城市大学计算与数学系，英国曼彻斯特，M15 6BX
⁵Nextwave公司，韩国大田市，34134
⁶国立科技大学计算机软件工程系，巴基斯坦伊斯兰堡，H-12
摘要：深度学习加速器对于满足现代神经网络日益增长的计算需求至关重要。基于脉动阵列的加速器由二维网格状的处理元件组成，这些处理元件协同工作以加速矩阵乘法运算。此类加速器的能效至关重要，尤其是在边缘人工智能场景下。本文提出的SAPER-AI加速器是一种脉动阵列加速器，采用统一功耗格式来定义其功耗意图，几乎无需优化其微架构。该加速器以粗粒度方式关闭处理元件的行和列，从而使脉动阵列微架构能适应现代深度学习工作负载不断变化的计算需求。分析表明，在最佳情况下，32×32和64×64脉动阵列设计的能效分别提升10%至25%。此外，更大规模的脉动阵列的功率延迟积改善约6%。进一步，基于MobileNet和ResNet50模型的性能比较表明，脉动阵列在处理ResNet50工作负载时通常表现更佳。这是因为ResNet50所呈现的更为规整的卷积计算更受脉动阵列青睐，且随着脉动阵列规模增大，这一性能差距会进一步扩大。

关键词：人工智能加速器；专用集成电路设计；脉动阵列；低能耗设计

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Amirshahi A, Ansaloni G, Atienza D, 2023a. Accelerator-driven data arrangement to minimize transformers run-time on multi-core architectures. 15^th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13^th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, p.2:1-2:13.

[2]Amirshahi A, Klein JAH, Ansaloni G, et al., 2023b. TiC-SAT: tightly-coupled systolic accelerator for transformers. Proc 28^th Asia and South Pacific Design Automation Conf, p.657-663.

[3]Bobda C, Mbongue JM, Chow P, et al., 2022. The future of FPGA acceleration in datacenters and the cloud. ACM Trans Reconfig Technol Syst, 15(3):34.

[4]Chadha R, Bhasker J, 2012. An ASIC Low Power Primer: Analysis, Techniques and Specification. Springer Science & Business Media, New York, NY, USA.

[5]Chang SW, Kim DS, 2024. Scalable Transformer accelerator with variable systolic array for multiple models in voice assistant applications. Electronics, 13(23):4683.

[6]Chen YH, Emer J, Sze V, 2016. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News, 44(3):367-379.

[7]Chen YH, Yang TJ, Emer J, et al., 2019. Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J Emerg Sel Top Circ Syst, 9(2):292-308.

[8]Chen YJ, Luo T, Liu SL, et al., 2014. DaDianNao: a machine-learning supercomputer. 47^th Annual IEEE/ACM Int Symp on Microarchitecture, p.609-622.

[9]Chen YJ, Chen TS, Xu ZW, et al., 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun ACM, 59(11):105-112.

[10]Gao MY, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. ACM SIGARCH Comput Archit News, 45(1):751-764.

[11]Genc H, Kim S, Amid A, et al., 2021. Gemmini: enabling systematic deep-learning architecture evaluation via full-stack integration. Proc 58^th Annual Design Automation Conf, p.769-774.

[12]Guo C, Xue FC, Leng JW, et al., 2024. Accelerating sparse DNNs based on tiled GEMM. IEEE Trans Comput, 73(5):1275-1289.

[13]Hoefler T, Alistarh D, Ben-Nun T, et al., 2021. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J Mach Learn Res, 22(1):124.

[14]Inayat K, Muslim FB, Iqbal J, et al., 2023. Power-intent systolic array using modified parallel multiplier for machine learning acceleration. Sensors, 23(9):4297.

[15]Inayat K, Muslim FB, Mahmood T, et al., 2024. FPGA-assisted design space exploration of parameterized AI accelerators: a Quickloop approach. J Syst Archit, 155:103260.

[16]Jouppi NP, Young C, Patil N, et al., 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44^th Annual Int Symp on Computer Architecture, p.1-12.

[17]Kim B, Lee S, Trivedi AR, et al., 2020. Energy-efficient acceleration of deep neural networks on realtime-constrained embedded edge devices. IEEE Access, 8:216259-216270.

[18]Lai CT, Zhang W, 2024. gem5-NVDLA: a simulation framework for compiling, scheduling and architecture evaluation on AI system-on-chips. ACM Trans Des Autom Electr Systt, 29(5):84.

[19]Lee J, Kim C, Kang S, et al., 2018. An Energy-Efficient Unified Deep Neural Network Accelerator with Fully-Variable Weight Precision for Mobile Deep Learning Applications. https://old.hotchips.org/hc30/3posters/An_EnergyEfficient_Unified_Deep_Neural_Network_Accelerator.pdf [Accessed on Sept. 20, 2024].

[20]Li WQ, Liu TY, Xiao ZY, et al., 2023. TCADer: a tightly coupled accelerator design framework for heterogeneous system with hardware/software co-design. J Syst Archit, 136:102822.

[21]Loh J, Dudchenko L, Viga J, et al., 2025. Towards hardware supported domain generalization in DNN-based edge computing devices for health monitoring. IEEE Trans Biomed Circ Syst, 19(1):5-15.

[22]Lym S, Erez M, 2020. FlexSA: flexible systolic array architecture for efficient pruned DNN model training.

[23]Moghaddasi I, Nam BG, 2024. Enhancing computation-efficiency of deep neural network processing on edge devices through serial/parallel systolic computing. Mach Learn Knowl Extr, 6(3):1484-1493.

[24]Moons B, De Brabandere B, Van Gool L, et al., 2016. Energy-efficient ConvNets through approximate computing. IEEE Winter Conf on Applications of Computer Vision, p.1-8.

[25]Moons B, Uytterhoeven R, Dehaene W, et al., 2017a. 14.5 envision: a 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOIF. IEEE Int Solid-State Circuits Conf, p.246-247.

[26]Moons B, Uytterhoeven R, Dehaene W, et al., 2017b. DVAFS: trading computational accuracy for energy through dynamic-voltage-accuracy-frequency-scaling. Design, Automation & Test in Europe Conf & Exhibition, p.488-493.

[27]Muslim FB, Inayat K, Khan S, 2024. LPCHISEL: automatic power intent generation for a chisel-based ASIC design. Comput Electr Eng, 115:109115.

[28]Park M, Hwang S, Cho H, 2024. BIRD: bi-directional input reuse dataflow for enhancing depthwise convolution performance on systolic arrays. IEEE Trans Comput, 73(12):2708-2721.

[29]Qamar A, Muslim FB, Iqbal J, et al., 2017. LP-HLS: automatic power-intent generation for high-level synthesis based hardware implementation flow. Microproc Microsyst, 50:26-38.

[30]Ryu S, Kim H, Yi W, et al., 2019. BitBlade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. Proc 56^th ACM/IEEE Design Automation Conf, p.1-6.

[31]Seshadri K, Akin B, Laudon J, et al., 2022. An evaluation of edge TPU accelerators for convolutional neural networks. IEEE Int Symp on Workload Characterization, p.79-91.

[32]Song J, Cho Y, Park JS, et al., 2019. 7.1 an 11.5 TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile SoC. IEEE Int Solid-State Circuits Conf, p.130-132.

[33]Xu R, Ma S, Wang YH, et al., 2021a. Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans Archit Code Optim, 18(4):42.

[34]Xu R, Ma S, Wang YH, et al., 2021b. Heterogeneous systolic array architecture for compact CNNs hardware accelerators. IEEE Trans Parallel Distrib Syst, 33(11):2860-2871.

[35]Xu R, Ma S, Guo Y, et al., 2023. A survey of design and optimization for systolic array-based DNN accelerators. ACM Comput Surv, 56(1):20.

[36]Yüzügüler AC, Sönmez C, Drumond M, et al., 2023. Scale-out systolic arrays. ACM Trans Archit Code Optim, 20(2):27.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

SAPER-AI加速器：一种基于脉动阵列的低能耗可重构人工智能加速器

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference