CLC number: TP39
On-line Access: 2025-10-13
Received: 2024-09-21
Revision Accepted: 2025-01-24
Crosschecked: 2025-10-13
Cited: 0
Clicked: 835
Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM. SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(9): 1624-1636.
@article{title="SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator",
author="Fahad Bin MUSLIM, Kashif INAYAT, Muhammad Zain SIDDIQI, Safiullah KHAN, Tayyeb MAHMOOD, Ihtesham ul ISLAM",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="9",
pages="1624-1636",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400867"
}
%0 Journal Article
%T SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator
%A Fahad Bin MUSLIM
%A Kashif INAYAT
%A Muhammad Zain SIDDIQI
%A Safiullah KHAN
%A Tayyeb MAHMOOD
%A Ihtesham ul ISLAM
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 9
%P 1624-1636
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400867
TY - JOUR
T1 - SAPER-AI accelerator: a systolic array-based power-efficient reconfigurable AI accelerator
A1 - Fahad Bin MUSLIM
A1 - Kashif INAYAT
A1 - Muhammad Zain SIDDIQI
A1 - Safiullah KHAN
A1 - Tayyeb MAHMOOD
A1 - Ihtesham ul ISLAM
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 9
SP - 1624
EP - 1636
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400867
Abstract: Deep learning (DL) accelerators are critical for handling the growing computational demands of modern neural networks. Systolic array (SA)-based accelerators consist of a 2D mesh of processing elements (PEs) working cooperatively to accelerate matrix multiplication. The power efficiency of such accelerators is of primary importance, especially considering the edge AI regime. This work presents the SAPER-AI accelerator, an SA accelerator with power intent specified via a unified power format representation in a simplified manner with negligible microarchitectural optimization effort. Our proposed accelerator switches off rows and columns of PEs in a coarse-grained manner, thus leading to SA microarchitecture complying with the varying computational requirements of modern DL workloads. Our analysis demonstrates enhanced power efficiency ranging between 10% and 25% for the best case 32×32 and 64×64 SA designs, respectively. Additionally, the power delay product (PDP) exhibits a progressive improvement of around 6% for larger SA sizes. Moreover, a performance comparison between the MobileNet and ResNet50 models indicates generally better SA performance for the ResNet50 workload. This is due to the more regular convolutions portrayed by ResNet50 that are more favored by SAs, with the performance gap widening as the SA size increases.
[1]Amirshahi A, Ansaloni G, Atienza D, 2023a. Accelerator-driven data arrangement to minimize transformers run-time on multi-core architectures. 15th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, p.2:1-2:13.
[2]Amirshahi A, Klein JAH, Ansaloni G, et al., 2023b. TiC-SAT: tightly-coupled systolic accelerator for transformers. Proc 28th Asia and South Pacific Design Automation Conf, p.657-663.
[3]Bobda C, Mbongue JM, Chow P, et al., 2022. The future of FPGA acceleration in datacenters and the cloud. ACM Trans Reconfig Technol Syst, 15(3):34.
[4]Chadha R, Bhasker J, 2012. An ASIC Low Power Primer: Analysis, Techniques and Specification. Springer Science & Business Media, New York, NY, USA.
[5]Chang SW, Kim DS, 2024. Scalable Transformer accelerator with variable systolic array for multiple models in voice assistant applications. Electronics, 13(23):4683.
[6]Chen YH, Emer J, Sze V, 2016. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comput Archit News, 44(3):367-379.
[7]Chen YH, Yang TJ, Emer J, et al., 2019. Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J Emerg Sel Top Circ Syst, 9(2):292-308.
[8]Chen YJ, Luo T, Liu SL, et al., 2014. DaDianNao: a machine-learning supercomputer. 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.609-622.
[9]Chen YJ, Chen TS, Xu ZW, et al., 2016. DianNao family: energy-efficient hardware accelerators for machine learning. Commun ACM, 59(11):105-112.
[10]Gao MY, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. ACM SIGARCH Comput Archit News, 45(1):751-764.
[11]Genc H, Kim S, Amid A, et al., 2021. Gemmini: enabling systematic deep-learning architecture evaluation via full-stack integration. Proc 58th Annual Design Automation Conf, p.769-774.
[12]Guo C, Xue FC, Leng JW, et al., 2024. Accelerating sparse DNNs based on tiled GEMM. IEEE Trans Comput, 73(5):1275-1289.
[13]Hoefler T, Alistarh D, Ben-Nun T, et al., 2021. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J Mach Learn Res, 22(1):124.
[14]Inayat K, Muslim FB, Iqbal J, et al., 2023. Power-intent systolic array using modified parallel multiplier for machine learning acceleration. Sensors, 23(9):4297.
[15]Inayat K, Muslim FB, Mahmood T, et al., 2024. FPGA-assisted design space exploration of parameterized AI accelerators: a Quickloop approach. J Syst Archit, 155:103260.
[16]Jouppi NP, Young C, Patil N, et al., 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44th Annual Int Symp on Computer Architecture, p.1-12.
[17]Kim B, Lee S, Trivedi AR, et al., 2020. Energy-efficient acceleration of deep neural networks on realtime-constrained embedded edge devices. IEEE Access, 8:216259-216270.
[18]Lai CT, Zhang W, 2024. gem5-NVDLA: a simulation framework for compiling, scheduling and architecture evaluation on AI system-on-chips. ACM Trans Des Autom Electr Systt, 29(5):84.
[19]Lee J, Kim C, Kang S, et al., 2018. An Energy-Efficient Unified Deep Neural Network Accelerator with Fully-Variable Weight Precision for Mobile Deep Learning Applications. https://old.hotchips.org/hc30/3posters/An_EnergyEfficient_Unified_Deep_Neural_Network_Accelerator.pdf [Accessed on Sept. 20, 2024].
[20]Li WQ, Liu TY, Xiao ZY, et al., 2023. TCADer: a tightly coupled accelerator design framework for heterogeneous system with hardware/software co-design. J Syst Archit, 136:102822.
[21]Loh J, Dudchenko L, Viga J, et al., 2025. Towards hardware supported domain generalization in DNN-based edge computing devices for health monitoring. IEEE Trans Biomed Circ Syst, 19(1):5-15.
[22]Lym S, Erez M, 2020. FlexSA: flexible systolic array architecture for efficient pruned DNN model training.
[23]Moghaddasi I, Nam BG, 2024. Enhancing computation-efficiency of deep neural network processing on edge devices through serial/parallel systolic computing. Mach Learn Knowl Extr, 6(3):1484-1493.
[24]Moons B, De Brabandere B, Van Gool L, et al., 2016. Energy-efficient ConvNets through approximate computing. IEEE Winter Conf on Applications of Computer Vision, p.1-8.
[25]Moons B, Uytterhoeven R, Dehaene W, et al., 2017a. 14.5 envision: a 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOIF. IEEE Int Solid-State Circuits Conf, p.246-247.
[26]Moons B, Uytterhoeven R, Dehaene W, et al., 2017b. DVAFS: trading computational accuracy for energy through dynamic-voltage-accuracy-frequency-scaling. Design, Automation & Test in Europe Conf & Exhibition, p.488-493.
[27]Muslim FB, Inayat K, Khan S, 2024. LPCHISEL: automatic power intent generation for a chisel-based ASIC design. Comput Electr Eng, 115:109115.
[28]Park M, Hwang S, Cho H, 2024. BIRD: bi-directional input reuse dataflow for enhancing depthwise convolution performance on systolic arrays. IEEE Trans Comput, 73(12):2708-2721.
[29]Qamar A, Muslim FB, Iqbal J, et al., 2017. LP-HLS: automatic power-intent generation for high-level synthesis based hardware implementation flow. Microproc Microsyst, 50:26-38.
[30]Ryu S, Kim H, Yi W, et al., 2019. BitBlade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation. Proc 56th ACM/IEEE Design Automation Conf, p.1-6.
[31]Seshadri K, Akin B, Laudon J, et al., 2022. An evaluation of edge TPU accelerators for convolutional neural networks. IEEE Int Symp on Workload Characterization, p.79-91.
[32]Song J, Cho Y, Park JS, et al., 2019. 7.1 an 11.5 TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile SoC. IEEE Int Solid-State Circuits Conf, p.130-132.
[33]Xu R, Ma S, Wang YH, et al., 2021a. Configurable multi-directional systolic array architecture for convolutional neural networks. ACM Trans Archit Code Optim, 18(4):42.
[34]Xu R, Ma S, Wang YH, et al., 2021b. Heterogeneous systolic array architecture for compact CNNs hardware accelerators. IEEE Trans Parallel Distrib Syst, 33(11):2860-2871.
[35]Xu R, Ma S, Guo Y, et al., 2023. A survey of design and optimization for systolic array-based DNN accelerators. ACM Comput Surv, 56(1):20.
[36]Yüzügüler AC, Sönmez C, Drumond M, et al., 2023. Scale-out systolic arrays. ACM Trans Archit Code Optim, 20(2):27.
Open peer comments: Debate/Discuss/Question/Opinion
<1>