CLC number: TP391.4
On-line Access: 2021-05-17
Received: 2019-12-10
Revision Accepted: 2020-07-12
Crosschecked: 2020-11-18
Cited: 0
Clicked: 4375
Citations: Bibtex RefMan EndNote GB/T7714
Duolin Huang, Qirong Mao, Zhongchen Ma, Zhishen Zheng, Sidheswar Routryar, Elias-Nii-Noi Ocquaye. Latent discriminative representation learning for speaker recognition[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900690 @article{title="Latent discriminative representation learning for speaker recognition", %0 Journal Article TY - JOUR
用于说话人识别的潜在可区分性表征学习1江苏大学计算机科学与通信工程学院,中国镇江市,212013 2江苏省工业网络空间安全技术重点实验室,中国镇江市,212013 摘要:从语音信号中提取特定说话人的可区分性表征,并将其转换为固定长度的向量是说话人识别和验证系统的关键步骤。提出一种潜在的可区分性表征学习方法,用于说话人识别。我们认为所学表征不仅具有可区分性,还具有相关性。具体来说,引入附加说话人嵌入查找表以探索同一说话人不同语音之间的相关性。此外,引入一个重构约束用于学习线性映射矩阵,使表征更具可区分性。实验结果表明,所提方法在INTERSPEECH2019会议的Fearless Step Challenge挑战赛的Apollo数据集和TIMIT数据集上的性能优于目前最先进方法。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Al-Kaltakchi MTS, Woo WL, Dlay SS, et al., 2016. Study of statistical robust closed set speaker identification with feature and score-based fusion. IEEE Statistical Signal Processing Workshop, p.1-5. [2]Al-Kaltakchi MTS, Woo WL, Dlay SS, et al., 2017. Speaker identification evaluation based on the speech biometric and i-vector model using the TIMIT and NTIMIT databases. Proc 5th Int Workshop on Biometrics and Forensics, p.1-6. [3]Chen NX, Qian YM, Yu K, 2015. Multi-task learning for text-dependent speaker verification. Proc 16th Annual Conf of the Int Speech Communication Association, p.185-189. [4]Chen XB, Cai YF, Chen L, et al., 2015. Discriminant feature extraction for image recognition using complete robust maximum margin criterion. Mach Vis Appl, 26(7-8):857-870. [5]Cumani S, Plchot O, Laface P, 2013. Probabilistic linear discriminant analysis of i-vector posterior distributions. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.7644-7648. [6]Davis S, Mermelstein P, 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process, 28(4):357-366. [7]Dehak N, Kenny PJ, Dehak R, et al., 2011. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process, 19(4):788-798. [8]Desai D, Joshi M, 2013. Speaker recognition using MFCC and hybrid model of VQ and GMM. Proc 2nd Int Symp on Intelligent Informatics, p.53-63. [9]Dey S, Motlicek P, Madikeri S, et al., 2017. Template-matching for text-dependent speaker verification. Speech Commun, 88:96-105. [10]Fisusi A, Yesufu T, 2007. Speaker recognition systems: a tutorial. Afr J Inform Commun Technol, 3(2):42-52. [11]Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report N, 93:27403, NASA, USA. [12]Hansen JHL, Sangwan A, Joglekar A, et al., 2018. Fearless steps: Apollo-11 corpus advancements for speech technologies from Earth to the Moon. Proc 19th Annual Conf of the Int Speech Communication Association, p.2758-2762. [13]Heigold G, Moreno I, Bengio S, et al., 2016. End-to-end text-dependent speaker verification. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5115-5119. [14]Hermansky H, 1990. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am, 87(4):1738-1752. [15]Huang XD, Acero A, Hon HW, 2001. Spoken Language Processing: a Guide to Theory, Algorithm and System Development. Upper Saddle River, Prentice Hall PTR, USA. [16]Jiang HJ, Wang RP, Shan SG, et al., 2017. Learning discriminative latent attributes for zero-shot classification. IEEE Int Conf on Computer Vision, p.4233-4242. [17]Kenny P, Boulianne G, Ouellet P, et al., 2007. Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process, 15(4):1448-1460. [18]Kim MJ, Yang IH, Kim MS, et al., 2017. Histogram equalization using a reduced feature set of background speakers’ utterances for speaker recognition. Front Inform Technol Electron Eng, 18(5):738-750. [19]Kumar R, Yeruva V, Ganapathy S, 2018. On convolutional LSTM modeling for joint wake-word detection and text dependent speaker verification. Proc 19th Annual Conf of the Int Speech Communication Association, p.1121-1125. [20]Lei Y, Scheffer N, Ferrer L, et al., 2014. A novel scheme for speaker recognition using a phonetically-aware deep neural network. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1695-1699. [21]Li C, Ma XK, Jiang B, et al., 2017. Deep speaker: an end-to-end neural speaker embedding system. https://arxiv.org/abs/1705.02304 [22]Luo Y, Liu Y, Zhang Y, et al., 2018. Speech bottleneck feature extraction method based on overlapping group lasso sparse deep neural network. Speech Commun, 99:56-61. [23]Mao QR, Dong M, Huang ZW, et al., 2014. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multim, 16(8):2203-2213. [24]Peri R, Pal M, Jati A, et al., 2019. Robust speaker recognition using unsupervised adversarial invariance. https://arxiv.org/abs/1911.00940 [25]Rabiner LR, 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 77(2):257-286. [26]Reynolds DA, Rose RC, 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process, 3(1):72-83. [27]Reynolds DA, Quatieri TF, Dunn RB, 2000. Speaker verification using adapted Gaussian mixture models. Dig Signal Process, 10(1-3):19-41. [28]Sadjadi SO, Slaney M, Heck L, et al., 2013. MSR Identity Toolbox v1.0: a MATLAB Toolbox for Speaker Recognition Research. Microsoft Research Technical Report, Piscataway, NJ, USA. [29]Schroff F, Kalenichenko D, Philbin J, 2015. FaceNet: a unified embedding for face recognition and clustering. IEEE Conf on Computer Vision and Pattern Recognition, p.815-823. [30]Singh S, Rajan EG, 2011. Vector quantization approach for speaker recognition using MFCC and inverted MFCC. Int J Comput Appl, 17(1):1-7. [31]Snyder D, Ghahremani P, Povey D, et al., 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. IEEE Spoken Language Technology Workshop, p.165-170. [32]Snyder D, Garcia-Romero D, Povey D, et al., 2017. Deep neural network embeddings for text-independent speaker verification. Proc 18th Annual Conf of the Int Speech Communication Association, p.999-1003. [33]Snyder D, Garcia-Romero D, Sell G, et al., 2018. X-vectors: robust DNN embeddings for speaker recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5329-5333. [34]Togneri R, Pullella D, 2011. An overview of speaker identification: accuracy and robustness issues. IEEE Circ Syst Mag, 11(2):23-61. [35]van Leeuwen DA, Saeidi R, 2013. Knowing the non-target speakers: the effect of the i-vector population for PLDA training in speaker recognition. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.6778-6782. [36]Variani E, Lei X, McDermott E, et al., 2014. Deep neural networks for small footprint text-dependent speaker verification. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4052-4056. [37]Wan V, Campbell WM, 2000. Support vector machines for speaker verification and identification. Neural Networks for Signal Processing X. Proc IEEE Signal Processing Society Workshop, p.775-784. [38]Wen YD, Zhang KP, Li ZF, et al., 2016. A discriminative feature learning approach for deep face recognition. Proc 14th European Conf on Computer Vision, p.499-515. [39]Yadav S, Rai A, 2018. Learning discriminative features for speaker identification and verification. Proc 19th Annual Conf of the Int Speech Communication Association, p.2237-2241. [40]Yoshimura T, Koike N, Hashimoto K, et al., 2018. Discriminative feature extraction based on sequential variational autoencoder for speaker recognition. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1742-1746. [41]Young S, 1993. The HTK Hidden Markov Model Toolkit: Design and Philosophy. Department of Engineering, Cambridge University, Cambridge. [42]Yu K, Mason J, Oglesby J, 1995. Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEE Proc Vis Image Signal Process, 142(5):313-318. [43]Zhang C, Koishida K, 2017. End-to-end text-independent speaker verification with triplet loss on short utterances. Proc 18th Annual Conf of the Int Speech Communication Association, p.1487-1491. [44]Zhang FF, Zhang TZ, Mao QR, et al., 2018. Joint pose and expression modeling for facial expression recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.3359-3368. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>