CLC number: TN912.3
On-line Access: 2020-11-13
Received: 2020-01-13
Revision Accepted: 2020-06-21
Crosschecked: 2020-09-08
Cited: 0
Clicked: 4176
Citations: Bibtex RefMan EndNote GB/T7714
Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng. Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000019 @article{title="Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder", %0 Journal Article TY - JOUR
基于加权因子自动编码器和潜在特定源生成因子学习的单通道语音分离陈静静1,毛启容1,2,秦友才1,钱双庆1,郑志燊1 1江苏大学计算机科学与通信工程学院,中国镇江市,212013 2江苏省工业网络安全技术重点实验室,中国镇江市,212013 摘要:通过一系列基于自动编码器的深度学习网络结构,单通道语音分离方法最近取得诸多进展,其使用编码器将输入信号压缩为中间特征,再把这些特征送入解码器重构感兴趣的特定音频源。然而,这些方法既无法为单通道语音分离学习原始输入的生成因子,也无法构造混合语音中的所有音频源。本文提出一个新的加权因子自动编码器模型,在目标函数中引入正则化损失以约束目标源,摒除其他信号源。通过在分离层中引入潜在注意力机制和监督源构造器,加权因子自动编码器可为每一个信号源习得特定于源的生成因子和一组鉴别性特征,从而提升单通道语音分离性能。在基准数据集上的实验表明所提方法优于现有方法。就3个重要指标而言,加权因子自动编码器在相对更具挑战性的任务(与说话人无关的单通道语音分离)上取得巨大成功。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Araki S, Sawada H, Mukai R, et al., 2007. Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process, 87(8):1833-1847. [2]Benesty J, Chen JD, Huang YT, 2008. Microphone Array Signal Processing. Springer, Berlin, Germany. [3]Bregman AS, 1990. Auditory Scene Analysis: the Perceptual Organization of Sound. The MIT Press, Cambridge. [4]Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297-336. [5]Chen Z, Luo Y, Mesgarani N, 2017. Deep attractor network for single-microphone speaker separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.246-250. [6]Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.708-712. [7]Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report, NASA, USA. [8]Ghahramani Z, Jordan MI, 1997. Factorial hidden Markov models. Mach Learn, 29(2-3):245-273. [9]Gou JP, Yi Z, Zhang D, et al., 2018. Sparsity and geometry preserving graph embedding for dimensionality reduction. IEEE Access, 6:75748-75766. [10]Grais EM, Plumbley MD, 2017. Single channel audio source separation using convolutional denoising autoencoders. Proc IEEE Global Conf on Signal and Information Processing, p.1265-1269. [11]Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.31-35. [12]Hsu WN, Zhang Y, Glass J, 2017. Learning latent representations for speech generation and transformation. 18th Annual Conf of the Int Speech Communication Association, p.1273-1277. [13]Hu K, Wang DL, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122-131. [14]Huang PS, Kim M, Hasegawa-Johnson M, et al., 2014. Deep learning for monaural speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1562-1566. [15]Hyvärinen A, Oja E, 2000. Independent component analysis: algorithms and applications. Neur Netw, 13(4-5):411-430. [16]Karamatli E, Cemgil AT, Kirbiz S, 2019. Weak label supervision for monaural source separation using non-negative denoising variational autoencoders. Proc 27th Signal Processing and Communications Applications Conf, p.1-4. [17]Kolbaek M, Yu D, Tan ZH, et al., 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process, 25(10):1901-1913. [18]Luo Y, Mesgarani N, 2019. Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process, 27(8):1256-1266. [19]Luo Y, Chen Z, Yoshioka T, 2019. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. https://arxiv.org/abs/1910.06379 [20]Nadas A, Nahamoo D, Picheny MA, 1989. Speech recognition using noise-adaptive prototypes. IEEE Trans Acoust Speech Signal Process, 37(10):1495-1503. [21]Osako K, Mitsufuji Y, Singh R, et al., 2017. Supervised monaural source separation based on autoencoders. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.11-15. [22]Panayotov V, Chen GG, Povey D, et al., 2015. LibriSpeech: an ASR corpus based on public domain audio books. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5206-5210. [23]Pandey L, Kumar A, Namboodiri V, 2018. Monaural audio source separation using variational autoencoders. Proc Interspeech, p.3489-3493. [24]Qian YM, Weng C, Chang XK, et al., 2018. Past review, current progress, and challenges ahead on the cocktail party problem. Front Inform Technol Electron Eng, 19(1):40-63. [25]Radford A, Metz L, Chintala S, 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/abs/1511.06434 [26]Roweis ST, 2001. One microphone source separation. Proc 13th Int Conf on Neural Information Processing Systems, p.793-799. [27]Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Proc 9th Int Conf on Spoken Language Processing. [28]Smaragdis P, 2007. Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process, 15(1):1-12. [29]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(11):2579-2605. [30]Vincent E, Gribonval R, Fevotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462-1469. [31]Wang DL, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, USA. [32]Wang YN, Du J, Dai LR, et al., 2016. Unsupervised single-channel speech separation via deep neural network for different gender mixtures. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1-4. [33]Wang YX, Narayanan A, Wang DL, 2014. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process, 22(12):1849-1858. [34]Williamson DS, 2018. Monaural speech separation using a phase-aware deep denoising auto encoder. Proc IEEE 28th Int Workshop on Machine Learning for Signal Processing, p.1-6. [35]Xia LM, Wang H, Guo WT, 2019. Gait recognition based on Wasserstein generating adversarial image inpainting network. J Cent South Univ, 26(10):2759-2770. [36]Yu D, Kolbaek M, Tan ZH, et al., 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.241-245. [37]Zhang QJ, Zhang L, 2018. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Front Comput Sci, 12(6):1140-1148. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>