JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Cross-layer efforts for energy-efficient computing: towards peta operations per second per watt

Author(s): Xiaobo Sharon Hu, Michael Niemier
Affiliation(s): Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
Corresponding email(s): shu@nd.edu, mniemier@nd.edu
Key Words: Moore's law, Energy-efficient computing, Neural network accelerators, Beyond-CMOS devices

Share this article to： More <<< Previous Paper \|Next Paper >>>

Xiaobo Sharon Hu, Michael Niemier. Cross-layer efforts for energy-efficient computing: towards peta operations per second per watt[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1800466

@article{title="Cross-layer efforts for energy-efficient computing: towards peta operations per second per watt",
author="Xiaobo Sharon Hu, Michael Niemier",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1800466"
}

%0 Journal Article
%T Cross-layer efforts for energy-efficient computing: towards peta operations per second per watt
%A Xiaobo Sharon Hu
%A Michael Niemier
%J Frontiers of Information Technology & Electronic Engineering
%P 1209-1223
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1800466"

TY - JOUR
T1 - Cross-layer efforts for energy-efficient computing: towards peta operations per second per watt
A1 - Xiaobo Sharon Hu
A1 - Michael Niemier
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1209
EP - 1223
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1800466"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: As moore's law based device scaling and accompanying performance scaling trends are slowing down, there is increasing interest in new technologies and computational models for fast and more energy-efficient information processing. Meanwhile, there is growing evidence that, with respect to traditional Boolean circuits and von Neumann processors, it will be challenging for beyond-CMOS devices to compete with the CMOS technology. Exploiting unique characteristics of emerging devices, especially in the context of alternative circuit and architectural paradigms, has the potential to offer orders of magnitude improvement in terms of power, performance, and capability. To take full advantage of beyond-CMOS devices, cross-layer efforts spanning from devices to circuits to architectures to algorithms are indispensable. This study examines energy-efficient neural network accelerators for embedded applications in this context. Several deep neural network accelerator designs based on cross-layer efforts spanning from alternative device technologies, circuit styles, to architectures are highlighted. Application-level benchmarking studies are presented. The discussions demonstrate that cross-layer efforts indeed can lead to orders of magnitude gain towards achieving extreme-scale energy-efficient processing.

高效节能计算的跨层设计：为实现每瓦特电力每秒千万亿次运算

摘要：由于基于摩尔定律的器件缩小及其性能增长趋势正在放缓，实现快速和高效节能信息处理的新技术和计算模型越来越被关注。与此同时，越来越多证据表明，对于传统布尔电路和冯诺依曼处理器，超CMOS器件很难与CMOS技术竞争。开发利用新兴器件的独特性能，特别是在非传统电路和架构背景下，具有提供在功率、性能和能力方面数十或百、千倍的改进潜力。为充分发挥超CMOS器件的优势，从器件到电路到体系结架再到算法的跨层设计工作不可或缺。在此背景下，本文研究了嵌入式应用中的高性能神经网络加速器，重点阐述了基于非传统器件技术、电路样式到架构的跨层工作的几种深度神经网络加速器的设计，介绍了应用级基准验证研究工作。讨论表明，跨层设计工作的确可以在实现极大规模高效节能处理方面带来数量级的改进。

关键词组：摩尔定律；高效节能技术；神经网络加速器；超CMOS器件

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Avci UE, Rios R, Kuhn K, et al., 2011. Comparison of performance, switching energy and process variations for the TFET and MOSFET in logic. Symp. on VLSI Technology, Digest of Technical Papers, p.124-125.

[2]Aziz A, Breyer ET, Chen A, et al., 2018. Computing with ferroelectric FETs: devices, models, systems, and linebreak applications. Proc Design, Automation & Test in Europe Conf Exhibition, p.1289-1298.

[3]Bottou L, 2010. Large-scale machine learning with stochastic gradient descent. Proc 19^th Int Conf on Computational Statistics, p.177-186.

[4]Chen XM, Yin XZ, Niemier M, et al., 2018. Design and optimization of FeFET-based crossbars for binary convolution neural networks. Proc Design, Automation & Test in Europe Conf Exhibition, p.1205-1210.

[5]Chen YH, Krishna T, Emer JS, et al., 2017. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Sol-State Circ, 52(1):127-138.

[6]Chua LO, Roska T, 2002. Cellular Neural Networks and Visual Computing: Foundations and Applications. Cambridge University Press, New York, NY, USA.

[7]Chua LO, Yang L, 1988. Cellular neural networks: theory. IEEE Trans Circ Syst, 35(10):1257-1272.

[8]Dahl GE, Sainath TN, Hinton GE, 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8609-8613.

[9]Esmaeilzadeh H, Blem E, St. Amant R, et al., 2011. Dark silicon and the end of multicore scaling. Proc 38^th Annual Int Symp on Computer Architecture, p.365-376.

[10]Esmaeilzadeh H, Blem E, St. Amant R, et al., 2013. Power challenges may end the multicore era. Commun ACM, 56(2):93-102.

[11]George S, Aziz A, Li XQ, et al., 2016a. Device circuit co design of FeFET based logic for low voltage processors. Proc IEEE Computer Society Annual Symp on VLSI, p.649-654.

[12]George S, Ma KS, Aziz A, et al., 2016b. Nonvolatile memory design based on ferroelectric FETs. Proc 53^rd Annual Design Automation Conf, Article 118.

[13]Horváth A, Hillmer M, Lou QW, et al., 2017. Cellular neural network friendly convolutional neural networks—CNNs with CNNs. Proc Design, Automation & Test in Europe Conf & Exhibition, p.145-150.

[14]Ionescu AM, Riel H, 2011. Tunnel field-effect transistors as energy-efficient electronic switches. Nature, 479(7373):329-337.

[15]Kam H, Liu TJK, Alon E, 2012. Design requirements for steeply switching logic devices. IEEE Trans Electron Dev, 59(2):326-334.

[16]Khatami Y, Banerjee K, 2009. Steep subthreshold slope n- and p-type tunnel-FET devices for low-power and energy-efficient digital circuits. IEEE Trans Electron Dev, 56(11):2752-2761.

[17]Kim K, Lee S, Kim JY, et al., 2008. A 125 GOPS 583 mW network-on-chip based parallel processor with bio-inspired visual attention engine. IEEE J Sol-State Circ, 44(1):136-147.

[18]LeCun Y, Bottou L, Bengio Y, et al., 1998. Gradient-based learning applied to document recognition. Proc IEEE, 86(11):2278-2324.

[19]Li MO, Yan RS, Jena D, et al., 2016. Two-dimensional heterojunction interlayer tunnel FET (Thin-TFET): from theory to applications. Proc IEEE Int Electron Devices Meeting, p.504-507.

[20]Liu HC, Datta S, Shoaran M, et al., 2014. Tunnel FET-based ultra-low power, low-noise amplifier design for bio-signal acquisition. Proc IEEE/ACM Int Symp on Low Power Electronics and Design, p.57-62.

[21]Lou QW, Palit I, Horváth A, et al., 2015. TFET-based operational transconductance amplifier design for CNN systems. Proc 25^th Edition on Great Lakes Symp on VLSI, p.277-282.

[22]Lou QW, Pan CY, McGuinness J, et al., 2018. A mixed signal architecture for convolutional neural networks. To appear in arXiv.

[23]Molinar-Solis JE, Gomez-Castaneda F, Moreno-Cadenas JA, et al., 2007. Programmable CMOS CNN cell based on floating-gate inverter unit. J VLSI Signal Process Syst Signal Image Video Technol, 49(1):207-216.

[24]Moons B, Verhelst M, 2016. A 0.3–2.6 TOPS/W precision-scalable processor for real-time large-scale ConvNets. Proc IEEE Symp on VLSI Circuits, p.1-2.

[25]Nikonov DE, Young IA, 2013. Overview of beyond-CMOS devices and a uniform methodology for their benchmarking. Proc IEEE, 101(12):2498-2533.

[26]Nikonov DE, Young IA, 2015. Benchmarking of beyond-CMOS exploratory devices for logic integrated circuits. IEEE J Explor Sol-State Comput Dev Circ, 1:3-11.

[27]Pan CY, Naeemi A, 2017a. Beyond-CMOS device benchmarking for Boolean and non-Boolean logic applications. http://cn.arxiv.org/abs/1711.04295

[28]Pan CY, Naeemi A, 2017b. Beyond-CMOS non-Boolean logic benchmarking: insights and future directions. Proc Design, Automation & Test in Europe Conf & Exhibition, p.133-138.

[29]Perricone R, Hu XS, Nahas J, et al., 2016. Can beyond-CMOS devices illuminate dark silicon? Design, Automation Test in Europe Conf Exhibition, p.13-18.

[30]Reagen B, Whatmough P, Adolf R, et al., 2016. Minerva: enabling low-power, highly-accurate deep neural network accelerators. Proc ACM/IEEE 43^nd Annual Int Symp on Computer Architecture, p.267-278.

[31]Reis D, Niemier M, Hu X, 2018. Computing in memory with FeFETs. Proc IEEE/ACM Int Symp on Low Power Electronics and Design, p.1-6.

[32]Rodriguez-Vázquez A, Li nán-Cembrano G, Carranza L, et al., 2004. Ace16k: the third generation of mixed-signal SIMD-CNN ACE chips toward VSoCs. IEEE Trans Circ Syst I, 51(5):851-863.

[33]Salahuddin S, Datta S, 2008. Use of negative capacitance to provide voltage amplification for low power nanoscale devices. Nano Lett, 8(2):405-410.

[34]Salmon L, 2017. A DARPA Perspective. https://www.src.org/calendar/e006128/agenda/salmon-darpa.pdf

[35]Scheutz M, McRaven J, Cserey G, 2004. Fast, reliable, adaptive, bimodal people tracking for indoor environments. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.1347-1352.

[36]Seabaugh AC, Zhang Q, 2010. Low-voltage tunnel transistors for beyond CMOS logic. Proc IEEE, 98(12):2095-2110.

[37]Sedighi B, Hu XS, Liu HC, et al., 2015. Analog circuit design using tunnel-FETs. IEEE Trans Circ Syst I, 62(1):39-48.

[38]Szegedy C, Vanhoucke V, Ioffe S, et al., 2016. Rethinking the inception architecture for computer vision. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2818-2826.

[39]Szolgay P, Szatmari I, Laszlo K, 1997. A fast fixed point learning method to implement associative memory on CNNs. IEEE Trans Circ Syst I, 44(4):362-366.

[40]Tang TQ, Xia LX, Li BX, et al., 2017. Binary convolutional neural network on RRAM. Proc 22^nd Asia and South Pacific Design Automation Conf, p.782-787.

[41]Wan L, Zeiler M, Zhang S, et al., 2013. Regularization of neural networks using dropconnect. Proc 30^th Int Conf on Machine Learning, p.1058-1066.

[42]Wang L, de Gyvez JP, Sanchez-Sinencio E, 1998. Time multiplexed color image processing based on a CNN with cell-state outputs. IEEE Trans VLSI Syst, 6(2):314-322.

[43]Whatmough PN, Lee SK, Lee H, et al., 2017. 14.3 A 28nm SoC with a 1.2GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing error rate tolerance for IoT applications. Proc IEEE Int Solid-State Circuits Conf, p.242-243.

[44]Xu XW, Lu Q, Wang TC, et al., 2017. Edge segmentation: empowering mobile telemedicine with compressed cellular neural networks. Proc 36^th Int Conf on Computer-Aided Design, p.880-887.

[45]Yin XZ, Aziz A, Nahas J, et al., 2016a. Exploiting ferroelectric FETs for low-power non-volatile logic-in-memory circuits. Proc IEEE/ACM Int Conf on Computer-Aided Design, p.1-8.

[46]Yin XZ, Sedighi B, Niemier M, et al., 2016b. Design of latches and flip-flops using emerging tunneling devices. Proc Design, Automation & Test in Europe Conf & Exhibition, p.1150-1155.

[47]Yin XZ, Niemier M, Hu XS, 2017. Design and benchmarking of ferroelectric FET based TCAM. Proc Design, Automation & Test in Europe Conf &Exhibition, p.1448-1453.

[48]Zhao W, Cao Y, 2006. New generation of predictive technology model for sub-45 nm early design exploration. IEEE Trans Electron Dev, 53(11):2816-2823.

[49]Zhou G, Li R, Vasen T, et al., 2012. Novel gate-recessed vertical InAs/GaSb TFETs with record high ION of 180 µA/µm at V_DS=0.5 V. Proc Int Electron Devices Meeting, p.777-780.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

高效节能计算的跨层设计：为实现每瓦特电力每秒千万亿次运算

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference