CLC number: TP3
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2018-01-26
Cited: 0
Clicked: 9441
Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, Han-qing Lu. Recent advances in efficient computation of deep convolutional neural networks[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 64-77.
@article{title="Recent advances in efficient computation of deep convolutional neural networks",
author="Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, Han-qing Lu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="1",
pages="64-77",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700789"
}
%0 Journal Article
%T Recent advances in efficient computation of deep convolutional neural networks
%A Jian Cheng
%A Pei-song Wang
%A Gang Li
%A Qing-hao Hu
%A Han-qing Lu
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 1
%P 64-77
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700789
TY - JOUR
T1 - Recent advances in efficient computation of deep convolutional neural networks
A1 - Jian Cheng
A1 - Pei-song Wang
A1 - Gang Li
A1 - Qing-hao Hu
A1 - Han-qing Lu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 1
SP - 64
EP - 77
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700789
Abstract: deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems. At the same time, the computational complexity and resource consumption of these networks continue to increase. This poses a significant challenge to the deployment of such networks, especially in real-time applications or on resource-limited devices. Thus, network acceleration has become a hot topic within the deep learning community. As for hardware implementation of deep neural networks, a batch of accelerators based on a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) have been proposed in recent years. In this paper, we provide a comprehensive survey of recent advances in network acceleration, compression, and accelerator design from both algorithm and hardware points of view. Specifically, we provide a thorough analysis of each of the following topics: network pruning, low-rank approximation, network quantization, teacher–student networks, compact network design, and hardware accelerators. Finally, we introduce and discuss a few possible future directions.
[1]Albericio J, Judd P, Hetherington T, et al., 2016. Cnvlutin: ineffectual-neuron-free deep neural network computing. Proc 43rd Int Symp on Computer Architecture, p.1-13.
[2]Alwani M, Chen H, Ferdman M, et al., 2016. Fused-layer CNN accelerators. 49th Annual IEEE/ACM Int Symp on MICRO, p.1-12.
[3]Anwar S, Hwang K, Sung W, 2017. Structured pruning of deep convolutional neural networks. ACM J Emerg Technol Comput Syst, 13(3), Article 32.
[4]Cai Z, He X, Sun J, et al., 2017. Deep learning with low precision by half-wave Gaussian quantization. IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.5918-5926.
[5]Chen L, Li J, Chen Y, et al., 2017. Accelerator-friendly neural-network training: learning variations and defects in RRAM crossbar. Proc Conf on Design, Automation and Test in Europe Conf and Exhibition, p.19-24.
[6]Chen Y, Sun N, Temam O, et al., 2014. DaDianNao: a machine-learning supercomputer. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.609-622.
[7]Chen Y, Krishna T, Emer J, et al., 2017. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Sol-Stat Circ, 52(1): 127-138.
[8]Cheng J, Wu J, Leng C, et al., 2017. Quantized CNN: a unified approach to accelerate and compress convolutional networks. IEEE Trans Neur Netw Learn Syst, 99:1-14.
[9]Cheng Z, Soudry D, Mao Z, et al., 2015. Training binary multilayer neural networks for image classification using expectation backpropagation. http://arxiv.org/abs/1503.03562
[10]Courbariaux M, Bengio Y, David J, 2015. Binaryconnect: training deep neural networks with binary weights during propagations. NIPS, p.3123-3131.
[11]Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. NIPS, p.2148-2156.
[12]Dettmers T, 2015. 8-bit approximations for parallelism in deep learning. http://arxiv.org/abs/1511.04561
[13]Gao M, Pu J, Yang X, et al., 2017. TETRIS: scalable and efficient neural network acceleration with 3D memory. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.751-764.
[14]Gong Y, Liu L, Yang M, et al., 2014. Compressing deep convolutional networks using vector quantization. http://arxiv.org/abs/1412.6115
[15]Gudovskiy D, Rigazio L, 2017. ShiftCNN: generalized low-precision architecture for inference of convolutional neural networks. http://arxiv.org/abs/1706.02393
[16]Guo Y, Yao A, Chen Y, 2016. Dynamic network surgery for efficient DNNs. NIPS, p.1379-1387.
[17]Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32nd Int Conf on Machine Learning, p.1737-1746.
[18]Hammerstrom D, 2012. A VLSI architecture for high-performance, low-cost, on-chip learning. IJCNN Int Joint Conf on Neural Networks, p.537-544.
[19]Han S, Mao H, Dally W, 2015a. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149
[20]Han S, Pool J, Tran J, et al., 2015b. Learning both weights and connections for efficient neural network. NIPS, p.1135-1143.
[21]Han S, Liu X, Mao H, et al., 2016. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.243-254.
[22]Han S, Kang J, Mao H, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75-84.
[23]Hassibi B, Stork D, 1993. Second order derivatives for network pruning: optimal brain surgeon. NIPS, p.164-171.
[24]He K, Zhang X, Ren S, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[25]He Y, Zhang X, Sun J, 2017. Channel pruning for accelerating very deep neural networks. http://arxiv.org/abs/1707.06168
[26]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. http://arxiv.org/abs/1503.02531
[27]Holi J, Hwang J, 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans Comput, 42(3):281-290.
[28]Horowitz M, 2014. 1.1 computing’s energy problem (and what we can do about it). IEEE Int Solid-State Circuits Conf Digest of Technical Papers, p.10-14.
[29]Hou L, Yao Q, Kwok J, 2016. Loss-aware binarization of deep networks. http://arxiv.org/abs/1611.01600
[30]Howard A, Zhu M, Chen B, et al., 2017. Mobilenets: efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861
[31]Hu Q, Wang P, Cheng J, 2018. From hashing to CNNs: training binary weight networks via hashing. 32nd AAAI Conf on Artificial Intelligence, in press.
[32]Hwang K, Sung W, 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and −1. IEEE Workshop on Signal Processing Systems, p.1-6.
[33]Iandola F, Han S, Moskewicz M, et al., 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. http://arxiv.org/abs/1602.07360
[34]Jaderberg M, Vedaldi A, Zisserman A, 2014. Speeding up convolutional neural networks with low rank expansions. http://arxiv.org/abs/1405.3866
[35]Jegou H, Douze M, Schmid C, 2011. Product quantization for nearest neighbor search. IEEE Trans Patt Anal Mach Intell, 33(1):117-128.
[36]Jouppi N, 2017. In-datacenter performance analysis of a tensor processing unit. Proc 44th Annual Int Symp on Computer Architecture, p.1-12.
[37]Kim D, Kung J, Chai S, et al., 2016. Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. ACM/IEEE 43rd Annual Int Symp on Computer Architecture, p.380-392.
[38]Kim K, Kim J, Yu J, et al., 2016. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. Proc 53rd Annual Design Automation Conf, Article 124.
[39]Kim M, Smaragdis P, 2016. Bitwise neural networks. http://arxiv.org/abs/1601.06071
[40]Kim YD, Park E, Yoo S, et al., 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. http://arxiv.org/abs/1511.06530
[41]Ko J, Mudassar B, Na T, et al., 2017. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation. ACM/EDAC/IEEE Design Automation Conf, p.1-6.
[42]Krizhevsky A, Hinton G, 2009. Learning Multiple Layers of Features from Tiny Images. MS Thesis, Department of Computer Science, University of Toronto, Toronto, Canada.
[43]Krizhevsky A, Sutskever I, Hinton G, 2012. Imagenet classification with deep convolutional neural networks. NIPS, p.1097-1105.
[44]Lebedev V, Lempitsky V, 2016. Fast ConvNets using group-wise brain damage. IEEE Conf on Computer Vision and Pattern Recognition, p.2554-2564.
[45]Lebedev V, Ganin Y, Rakhuba M, et al., 2014. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. http://arxiv.org/abs/1412.6553
[46]LeCun Y, Denker J, Solla S, et al., 1989. Optimal brain damage. NIPS, p.598-605.
[47]Lee EH, Miyashita D, Chai E, et al., 2017. LogNet: energy-efficient neural networks using logarithmic computation. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5900-5904.
[48]Li F, Zhang B, Liu B, 2016. Ternary weight networks. http://arxiv.org/abs/1605.04711
[49]Li G, Li F, Zhao T, et al., 2018. Block convolution: towards memory-efficient inference of large-scale CNNs on FPGA. Design Automation and Test in Europe, in press.
[50]Lin M, Chen Q, Yan S, 2013. Network in network. http://arxiv.org/abs/1312.4400
[51]Lin Z, Courbariaux M, Memisevic R, et al., 2015. Neural networks with few multiplications. http://arxiv.org/abs/1510.03009
[52]Liu S, Du Z, Tao J, et al., 2016. Cambricon: an instruction set architecture for neural networks. Proc 43rd Int Symp on Computer Architecture, p.393-405.
[53]Liu Z, Li J, Shen Z, et al., 2017. Learning efficient convolutional networks through network slimming. IEEE Int Conf on Computer Vision, p.2736-2744
[54]Luo J, Wu J, Lin W, 2017. ThiNet: a filter level pruning method for deep neural network compression. http://arxiv.org/abs/1707.06342
[55]Ma Y, Cao Y, Vrudhula S, et al., 2017a. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27th Int Conf on Field Programmable Logic and Applications, p.1-8.
[56]Ma Y, Cao Y, Vrudhula S, et al., 2017b. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.45-54.
[57]Ma Y, Kim M, Cao Y, et al., 2017c. End-to-end scalable FPGA accelerator for deep residual networks. IEEE Int Symp on Circuits and Systems, p.1-4.
[58]Mao H, Han S, Pool J, et al., 2017. Exploring the regularity of sparse structure in convolutional neural networks. http://arxiv.org/abs/1705.08922
[59]Miyashita D, Lee E, Murmann B, 2016. Convolutional neural networks using logarithmic data representation. http://arxiv.org/abs/1603.01025
[60]Nguyen D, Kim D, Lee J, 2017. Double MAC: doubling the performance of convolutional neural networks on modern FPGAs. Design, Automation & Test in Europe Conf & Exhibition, p.890-893.
[61]21740} Nurvitadhi E, Hillsboro, Venkatesh G, et al., 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.5-14.
[62]Parashar A, Rhu M, Mukkara A, et al., 2017. SCNN: an accelerator for compressed-sparse convolutional neural networks. Proc 44th Annual Int Symp on Computer Architecture, p.27-40.
[63]Price M, Glass J, Chandrakasan A, 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. IEEE Int Solid-State Circuits Conf, p.244-245.
[64]Qiu JT, Wang J, Yao S, et al., 2016. Going deeper with embedded FPGA platform for convolutional neural network. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.26-35.
[65]Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. European Conf on Computer Vision, p.525-542.
[66]Ren A, Li Z, Ding C, et al., 2017. SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. Proc 22nd Int Conf on Architectural Support for Programming Languages and Operating Systems, p.405-418.
[67]Romero A, Ballas N, Kahou S, et al., 2014. FitNets: hints for thin deep nets. http://arxiv.org/abs/1412.6550
[68]Russakovsky O, Deng J, Su H, et al., 2015. Imagenet large scale visual recognition challenge. Int J Comput Vis, 115(3):211-252.
[69]Sharma H, Park J, Mahajan D, et al., 2016. From high-level deep neural models to FPGAs. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-21.
[70]Shen Y, Ferdman M, Milder P, 2017. Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. IEEE 25th Annual Int Symp on Field-Programmable Custom Computing Machines, p.93-100.
[71]Sim H, Lee J, 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. Proc 54th Annual Design Automation Conf, Article 29.
[72]Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition. http://arxiv.org/abs/1409.1556
[73]Suda N, Chandra V, Dasika G, et al., 2016. Throughput-optimized openCL-based FPGA accelerator for large-scale convolutional neural networks. Proc ACM/linebreak SIGDA Int Symp on Field-Programmable Gate Arrays, p.16-25.
[74]Szegedy C, Liu W, Jia Y, et al., 2015. Going deeper with convolutions. Conf on Computer Vision and Pattern Recognition, p.1-9.
[75]Tang W, Hua G, Wang L, 2017. How to train a compact binary neural network with high accuracy? 31st AAAI Conf on Artificial Intelligence, p.2625-2631.
[76]Tann H, Hashemi S, Bahar I, et al., 2017. Hardware-software codesign of accurate, multiplier-free deep neural networks. 54th ACM/EDAC/IEEE Design Automation Conf, p.1-6.
[77]4} Umuroglu Y, Fraster N, Gambardella G, et al., 2017. FINN: a framework for fast, scalable binarized neural network inference. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.65-74.
[78]Venieris S, Bouganis C, 2016. fpgaAConvNet: a framework for mapping convolutional neural networks on FPGAs. IEEE 24th Annual Int Symp on Field-Programmable Custom Computing Machines, p.40-47.
[79]Venkataramani S, Ranjan A, Banerjee S, et al., 2017. ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. Proc 44th Annual Int Symp on Computer Architecture, p.13-26.
[80]Wang P, Cheng J, 2016. Accelerating convolutional neural networks for mobile applications. Proc ACM on Multimedia Conf, p.541-545.
[81]Wang P, Cheng J, 2017. Fixed-point factorized networks. IEEE Conf on Computer Vision and Pattern Recognition, p.4012-4020.
[82]Wang P, Hu Q, Fang Z, et al., 2018. Deepsearch: a fast image search framework for mobile devices. ACM Trans Multim Comput Commun Appl, 14(1), Article 6.
[83]Wang Y, Xu J, Han Y, et al., 2016. Deepburning: automatic generation of FPGA-based learning accelerators for the neural network family. Proc 53rd Annual Design Automation Conf, Article 110.
[84]Wei SC, Yu CH, Zhang P, et al., 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1-6.
[85]Wen W, Wu C, Wang Y, et al., 2016. Learning structured sparsity in deep neural networks. NIPS, p.2074-2082.
[86]Wu J, Leng C, Wang Y, et al., 2016. Quantized convolutional neural networks for mobile devices. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4820-4828.
[87]Xia L, Tang T, Huangfu W, et al., 2016. Switched by input: power efficient structure for RRAM-based convolutional neural network. 53rd ACM/EDAC/IEEE Design Automation Conf, p.1-6.
[88]Xiao QC, Liang Y, Lu LQ, et al., 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 54th ACM/EDAC/IEEE Design Automation Conf, p.1-6.
[89]Xie S, Girshick R, Dollar P, et al., 2017. Aggregated residual transformations for deep neural networks. IEEE Conf on Computer Vision and Pattern Recognition, p.5987-5995.
[90]Yang H, 2017. TIME: a training-in-memory architecture for memristor-based deep neural networks. 54th ACM/linebreak EDAC/IEEE Design Automation Conf, p.1-6.
[91]Zagoruyko S, Komodakis N, 2016. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. http://arxiv.org/abs/1612.03928
[92]Zhang C, Li P, Sun GY, et al., 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.161-170.
[93]Zhang C, Fang Z, Pan P, et al., 2016a. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. IEEE/ACM Int Conf on Computer-Aided Design, p.1-8.
[94]Zhang C, Wu D, Sun J, et al., 2016b. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. Proc Int Symp on Low Power Electronics and Design, p.326-331.
[95]Zhang S, Du Z, Zhang L, et al., 2016. Cambricon-X: an accelerator for sparse neural networks. 49th Annual IEEE/ACM Int Symp on Microarchitecture, p.1-12.
[96]Zhang X, Zou J, He K, et al., 2015. Accelerating very deep convolutional networks for classification and detection. IEEE Trans Patt Anal Mach Intell, 38(10):1943-1955.
[97]Zhang X, Zhou X, Lin M, et al., 2017. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. http://arxiv.org/abs/1707.01083
[98]Zhao R, Song WN, Zhang WT, et al., 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.15-24.
[99]Zhou A, Yao A, Guo Y, et al., 2017. Incremental network quantization: towards lossless CNNs with low-precision weights. http://arxiv.org/abs/1702.03044
[100]Zhou S, Wu Y, Ni Z, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. http://arxiv.org/abs/1606.06160
[101]Zhu C, Han S, Mao H, et al., 2016. Trained ternary quantization. http://arxiv.org/abs/1612.01064
[102]Zhu J, Qian Z, Tsui C, 2016. LRADNN: high-throughput and energy-efficient deep neural network accelerator using low rank approximation. 21st Asia and South Pacific Design Automation Conf, p.581-586.
Open peer comments: Debate/Discuss/Question/Opinion
<1>