JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Binary neural networks for speech recognition

Author(s): Yan-min Qian, Xu Xiang
Affiliation(s): Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; more
Corresponding email(s): yanminqian@sjtu.edu.cn, chinoiserie@sjtu.edu.cn
Key Words: Speech recognition, Binary neural networks, Binary matrix multiplication, Knowledge distillation, Population count

Share this article to： More <<< Previous Paper \|Next Paper >>>

Yan-min Qian, Xu Xiang. Binary neural networks for speech recognition[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1800469

@article{title="Binary neural networks for speech recognition",
author="Yan-min Qian, Xu Xiang",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1800469"
}

%0 Journal Article
%T Binary neural networks for speech recognition
%A Yan-min Qian
%A Xu Xiang
%J Frontiers of Information Technology & Electronic Engineering
%P 701-715
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1800469"

TY - JOUR
T1 - Binary neural networks for speech recognition
A1 - Yan-min Qian
A1 - Xu Xiang
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 701
EP - 715
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1800469"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5–7 times speedup compared with full precision floating-point matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3–4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

用于语音识别的二值神经网络

摘要：近年来，在语音识别的声学建模中，深度神经网络(DNNs)明显优于高斯混合模型。然而，推断阶段巨大的计算量使其难以部署在低功耗的嵌入式模型上。为此，稀疏性和低精度定点量化技术被广泛使用。为降低推理阶段计算量，本文开发了用于语音识别的二进制神经网络，并实现了高速的二值矩阵乘法。在中央处理器(CPU)和图形处理单元(GPU)上，二值矩阵乘法的运行速度是浮点矩阵乘法的5–7倍。针对大规模连续语音识别的声学建模，提出多种二值神经网络及相关模型优化算法。为提高二值模型的精度，探索了从浮点模型到二值模型的知识蒸馏技术。在标准的Switchboard语音识别任务上，该二值神经网络模型比浮点神经网络模型速度提高3–4倍。借助知识蒸馏技术，二值深度神经网络或卷积神经网络相对其浮点神经网络的词错误率增加可以保持在15%以内。若只二值化卷积神经网络的卷积层，词错误率增加几乎可忽略。

关键词组：语音识别；二值神经网络；二值矩阵乘法；知识蒸馏；位1计数

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bengio Y, Léonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432

[2]Bi MX, Qian YM, Yu K, 2015. Very deep convolutional neural networks for LVCSR. 16^th Annual Conf of Int Speech Communication Association, p.3259-3263.

[3]Chen ZH, Zhuang YM, Qian YM, et al., 2017. Phone synchronous speech recognition with CTC lattices. IEEE/ACM Trans Audio Speech Lang Process, 25(1): 90-101.

[4]Chen ZH, Luitjens J, Xu HN, et al., 2018a. A GPU-based WFST decoder with exact lattice generation. https://arxiv.org/abs/1804.03243

[5]Chen ZH, Liu Q, Li H, et al., 2018b. On modular training of neural acoustics-to-word model for LVCSR. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4754-4758.

[6]Chen ZH, Droppo J, Li JY, et al., 2018c. Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 26(1):184-196.

[7]Collobert R, Kavukcuoglu K, Farabet C, 2011. Torch7: a Matlab-like environment for machine learning. BigLearn NIPS Workshop.

[8]Courbariaux M, Hubara I, Soudry D, et al., 2016. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or $-$1. https://arxiv.org/abs/1602.02830

[9]Dahl GE, Yu D, Deng L, et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process, 20(1):30-42.

[10]Denil M, Shakibi B, Dinh L, et al., 2013. Predicting parameters in deep learning. 26^th Int Conf on Neural Information Processing Systems, p.2148-2156.

[11]Duchi J, Hazan E, Singer Y, 2011. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12:2121-2159.

[12]Goto K, van de Geijn RA, 2008. Anatomy of high-performance matrix multiplication. ACM Trans Math Softw, 34(3), Article 12.

[13]Gupta S, Agrawal A, Gopalakrishnan K, et al., 2015. Deep learning with limited numerical precision. Proc 32^nd Int Conf on Machine Learning, p.1737-1746.

[14]Hammarlund P, Martinez AJ, Bajwa AA, et al., 2014. Haswell: the fourth-generation Intel core processor. IEEE Micro, 34(2):6-20.

[15]Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural network. Proc 28^th Int Conf on Neural Information Processing Systems, p.1135-1143.

[16]Han S, Kang JL, Mao HZ, et al., 2017. ESE: efficient speech recognition engine with sparse LSTM on FPGA. Proc ACM/SIGDA Int Symp on Field-Programmable Gate Arrays, p.75-84.

[17]He TX, Fan YC, Qian YM, et al., 2014. Reshaping deep neural network for fast decoding by node-pruning. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.245-249.

[18]Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82-97.

[19]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. https://arxiv.org/abs/1503.02531

[20]Hubara I, Courbariaux M, Soudry D, et al., 2016. Quantized neural networks: training neural networks with low precision weights and activations. https://arxiv.org/abs/1609.07061

[21]Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. 32^nd Int Conf on Machine Learning, p.448-456.

[22]Jaitly N, Nguyen P, Senior A, et al., 2012. Application of pretrained deep neural networks to large vocabulary speech recognition. Proc 13^thAnnual Conf of the Int Speech Communication Association.

[23]Kingma D, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980

[24]Li JY, Seltzer ML, Wang X, et al., 2017. Large-scale domain adaptation via teacher-student learning. Proc 18^th Annual Conf of Int Speech Communication Association, p.2386-2390.

[25]Low TM, Igual FD, Smith TM, et al., 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw, 43(2), Article 12.

[26]Lu L, Renals S, 2017. Small-footprint highway deep neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 25(7):1502-1511.

[27]Lu L, Guo M, Renals S, 2017. Knowledge distillation for small-footprint highway networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4820-4824.

[28]Mohamed AR, Dahl GE, Hinton GE, 2012. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process, 20(1):14-22.

[29]Novikov A, Podoprikhin D, Osokin A, et al., 2015. Tensorizing neural networks. Advances in Neural Information Processing Systems, p.442-450.

[30]Povey D, Ghoshal A, Boulianne G, et al., 2011. The Kaldi speech recognition toolkit. Proc IEEE Workshop on Automatic Speech Recognition and Understanding.

[31]Qian YM, Woodland PC, 2016. Very deep convolutional neural networks for robust speech recognition. Proc IEEE Spoken Language Technology Workshop, p.481-488.

[32]Qian YM, He TX, Deng W, et al., 2015. Automatic model redundancy reduction for fast back-propagation for deep neural networks in speech recognition. Proc Int Joint Conf on Neural Networks, p.1-6.

[33]Qian YM, Bi MX, Tan T, et al., 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process, 24(12):2263-2276.

[34]Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14$^rm th$ European Conf on Computer Vision, p.525-542.

[35]Sainath TN, Mohamed AR, Kingsbury B, et al., 2013. Deep convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8614-8618.

[36]Sak H, Senior A, Beaufays F, 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proc 15^th Annual Conf of Int Speech Communication Association, p.338-342.

[37]Saon G, Kurata G, Sercu T, et al., 2017. English conversational telephone speech recognition by humans and machines. https://arxiv.org/abs/1703.02136

[38]Sercu T, Puhrsch C, Kingsbury B, et al., 2016. Very deep multilingual convolutional neural networks for LVCSR. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4955-4959.

[39]Wang YQ, Li JY, Gong YF, 2015. Small-footprint high-performance deep neural network-based speech recognition using split-VQ. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4984-4988.

[40]Xiong W, Droppo J, Huang X, et al., 2016. Achieving human parity in conversational speech recognition. https://arxiv.org/abs/1610.05256

[41]Xiong W, Droppo J, Huang X, et al., 2017. The Microsoft 2016 conversational speech recognition system. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.5255-5259.

[42]Xue J, Li JY, Gong YF, 2013. Restructuring of deep neural network acoustic models with singular value decomposition. Proc 14^th Annual Conf of Int Speech Communication Association, p.2365-2369.

[43]Young S, Evermann G, Gales M, et al., 2006. The HTK Book. Cambridge University Engineering Department, Cambridge, UK.

[44]Yu D, Seide F, Li G, et al., 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.4409-4412.

[45]Yu D, Xiong W, Droppo J, et al., 2016. Deep convolutional neural networks with layer-wise context expansion and attention. Proc 17^th Annual Conf of Int Speech Communication Association, p.17-21.

[46]Zhou SC, Wu YX, Ni ZK, et al., 2016. DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. https://arxiv.org/abs/1606.06160

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

用于语音识别的二值神经网络

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference