JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder

Author(s): Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng
Affiliation(s): School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China; more
Corresponding email(s): 2221808071@stmail.ujs.edu.cn, mao_qr@ujs.edu.cn, 2211908026@stmail.ujs.edu.cn, 2211908025@stmail.ujs.edu.cn, 3160602062@stmail.ujs.edu.cn
Key Words: Speech separation, Generative factors, Autoencoder, Deep learning

Share this article to： More <<< Previous Paper \|Next Paper >>>

Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng. Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000019

@article{title="Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder",
author="Jing-jing Chen, Qi-rong Mao, You-cai Qin, Shuang-qing Qian, Zhi-shen Zheng",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2000019"
}

%0 Journal Article
%T Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder
%A Jing-jing Chen
%A Qi-rong Mao
%A You-cai Qin
%A Shuang-qing Qian
%A Zhi-shen Zheng
%J Frontiers of Information Technology & Electronic Engineering
%P 1639-1650
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2000019"

TY - JOUR
T1 - Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder
A1 - Jing-jing Chen
A1 - Qi-rong Mao
A1 - You-cai Qin
A1 - Shuang-qing Qian
A1 - Zhi-shen Zheng
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1639
EP - 1650
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2000019"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest. However, these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech. In this study, we propose a novel weighted-factor autoencoder (WFAE) model for MSS, which introduces a regularization loss in the objective function to isolate one source without containing other sources. By incorporating a latent attention mechanism and a supervised source constructor in the separation layer, WFAE can learn source-specific generative factors and a set of discriminative features for each source, leading to MSS performance improvement. Experiments on benchmark datasets show that our approach outperforms the existing methods. In terms of three important metrics, WFAE has great success on a relatively challenging MSS case, i.e., speaker-independent MSS.

基于加权因子自动编码器和潜在特定源生成因子学习的单通道语音分离

陈静静¹，毛启容^1,2，秦友才¹，钱双庆¹，郑志燊¹
¹江苏大学计算机科学与通信工程学院，中国镇江市，212013
²江苏省工业网络安全技术重点实验室，中国镇江市，212013

摘要：通过一系列基于自动编码器的深度学习网络结构，单通道语音分离方法最近取得诸多进展，其使用编码器将输入信号压缩为中间特征，再把这些特征送入解码器重构感兴趣的特定音频源。然而，这些方法既无法为单通道语音分离学习原始输入的生成因子，也无法构造混合语音中的所有音频源。本文提出一个新的加权因子自动编码器模型，在目标函数中引入正则化损失以约束目标源，摒除其他信号源。通过在分离层中引入潜在注意力机制和监督源构造器，加权因子自动编码器可为每一个信号源习得特定于源的生成因子和一组鉴别性特征，从而提升单通道语音分离性能。在基准数据集上的实验表明所提方法优于现有方法。就3个重要指标而言，加权因子自动编码器在相对更具挑战性的任务（与说话人无关的单通道语音分离）上取得巨大成功。

关键词组：语音分离；生成因子；自动编码器；深度学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Araki S, Sawada H, Mukai R, et al., 2007. Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process, 87(8):1833-1847.

[2]Benesty J, Chen JD, Huang YT, 2008. Microphone Array Signal Processing. Springer, Berlin, Germany.

[3]Bregman AS, 1990. Auditory Scene Analysis: the Perceptual Organization of Sound. The MIT Press, Cambridge.

[4]Brown GJ, Cooke M, 1994. Computational auditory scene analysis. Comput Speech Lang, 8(4):297-336.

[5]Chen Z, Luo Y, Mesgarani N, 2017. Deep attractor network for single-microphone speaker separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.246-250.

[6]Erdogan H, Hershey JR, Watanabe S, et al., 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.708-712.

[7]Garofolo JS, Lamel LF, Fisher WM, et al., 1993. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report, NASA, USA.

[8]Ghahramani Z, Jordan MI, 1997. Factorial hidden Markov models. Mach Learn, 29(2-3):245-273.

[9]Gou JP, Yi Z, Zhang D, et al., 2018. Sparsity and geometry preserving graph embedding for dimensionality reduction. IEEE Access, 6:75748-75766.

[10]Grais EM, Plumbley MD, 2017. Single channel audio source separation using convolutional denoising autoencoders. Proc IEEE Global Conf on Signal and Information Processing, p.1265-1269.

[11]Hershey JR, Chen Z, Le Roux J, et al., 2016. Deep clustering: discriminative embeddings for segmentation and separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.31-35.

[12]Hsu WN, Zhang Y, Glass J, 2017. Learning latent representations for speech generation and transformation. 18^th Annual Conf of the Int Speech Communication Association, p.1273-1277.

[13]Hu K, Wang DL, 2013. An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process, 21(1):122-131.

[14]Huang PS, Kim M, Hasegawa-Johnson M, et al., 2014. Deep learning for monaural speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1562-1566.

[15]Hyvärinen A, Oja E, 2000. Independent component analysis: algorithms and applications. Neur Netw, 13(4-5):411-430.

[16]Karamatli E, Cemgil AT, Kirbiz S, 2019. Weak label supervision for monaural source separation using non-negative denoising variational autoencoders. Proc 27^th Signal Processing and Communications Applications Conf, p.1-4.

[17]Kolbaek M, Yu D, Tan ZH, et al., 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process, 25(10):1901-1913.

[18]Luo Y, Mesgarani N, 2019. Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans Audio Speech Lang Process, 27(8):1256-1266.

[19]Luo Y, Chen Z, Yoshioka T, 2019. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. https://arxiv.org/abs/1910.06379

[20]Nadas A, Nahamoo D, Picheny MA, 1989. Speech recognition using noise-adaptive prototypes. IEEE Trans Acoust Speech Signal Process, 37(10):1495-1503.

[21]Osako K, Mitsufuji Y, Singh R, et al., 2017. Supervised monaural source separation based on autoencoders. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.11-15.

[22]Panayotov V, Chen GG, Povey D, et al., 2015. LibriSpeech: an ASR corpus based on public domain audio books. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.5206-5210.

[23]Pandey L, Kumar A, Namboodiri V, 2018. Monaural audio source separation using variational autoencoders. Proc Interspeech, p.3489-3493.

[24]Qian YM, Weng C, Chang XK, et al., 2018. Past review, current progress, and challenges ahead on the cocktail party problem. Front Inform Technol Electron Eng, 19(1):40-63.

[25]Radford A, Metz L, Chintala S, 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. https://arxiv.org/abs/1511.06434

[26]Roweis ST, 2001. One microphone source separation. Proc 13^th Int Conf on Neural Information Processing Systems, p.793-799.

[27]Schmidt MN, Olsson RK, 2006. Single-channel speech separation using sparse non-negative matrix factorization. Proc 9^th Int Conf on Spoken Language Processing.

[28]Smaragdis P, 2007. Convolutive speech bases and their application to supervised speech separation. IEEE Trans Audio Speech Lang Process, 15(1):1-12.

[29]van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(11):2579-2605.

[30]Vincent E, Gribonval R, Fevotte C, 2006. Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process, 14(4):1462-1469.

[31]Wang DL, Brown GJ, 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, USA.

[32]Wang YN, Du J, Dai LR, et al., 2016. Unsupervised single-channel speech separation via deep neural network for different gender mixtures. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf, p.1-4.

[33]Wang YX, Narayanan A, Wang DL, 2014. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process, 22(12):1849-1858.

[34]Williamson DS, 2018. Monaural speech separation using a phase-aware deep denoising auto encoder. Proc IEEE 28^th Int Workshop on Machine Learning for Signal Processing, p.1-6.

[35]Xia LM, Wang H, Guo WT, 2019. Gait recognition based on Wasserstein generating adversarial image inpainting network. J Cent South Univ, 26(10):2759-2770.

[36]Yu D, Kolbaek M, Tan ZH, et al., 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.241-245.

[37]Zhang QJ, Zhang L, 2018. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Front Comput Sci, 12(6):1140-1148.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于加权因子自动编码器和潜在特定源生成因子学习的单通道语音分离

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference