Full Text:   <2413>

CLC number: TN912.3

On-line Access: 

Received: 2009-02-12

Revision Accepted: 2009-06-25

Crosschecked: 2009-12-08

Cited: 7

Clicked: 6231

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE C 2010 Vol.11 No.3 P.160-174

http://doi.org/10.1631/jzus.C0910087


Evaluating single-channel speech separation performance in transform-domain


Author(s):  Pejman MOWLAEE, Abolghasem SAYADIYAN, Hamid SHEIKHZADEH

Affiliation(s):  Department of Electronic Engineering, Amirkabir University of Technology, Tehran 15875-4413, Iran

Corresponding email(s):   pmowlaee@ieee.org, {eeas335, hsheikh}@aut.ac.ir

Key Words:  Single-channel separation (SCS), Magnitude spectrum, Vector quantization (VQ), Subband perceptually weighted transformation (SPWT), Spectral distortion (SD)


Pejman MOWLAEE, Abolghasem SAYADIYAN, Hamid SHEIKHZADEH. Evaluating single-channel speech separation performance in transform-domain[J]. Journal of Zhejiang University Science C, 2010, 11(3): 160-174.

@article{title="Evaluating single-channel speech separation performance in transform-domain",
author="Pejman MOWLAEE, Abolghasem SAYADIYAN, Hamid SHEIKHZADEH",
journal="Journal of Zhejiang University Science C",
volume="11",
number="3",
pages="160-174",
year="2010",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C0910087"
}

%0 Journal Article
%T Evaluating single-channel speech separation performance in transform-domain
%A Pejman MOWLAEE
%A Abolghasem SAYADIYAN
%A Hamid SHEIKHZADEH
%J Journal of Zhejiang University SCIENCE C
%V 11
%N 3
%P 160-174
%@ 1869-1951
%D 2010
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C0910087

TY - JOUR
T1 - Evaluating single-channel speech separation performance in transform-domain
A1 - Pejman MOWLAEE
A1 - Abolghasem SAYADIYAN
A1 - Hamid SHEIKHZADEH
J0 - Journal of Zhejiang University Science C
VL - 11
IS - 3
SP - 160
EP - 174
%@ 1869-1951
Y1 - 2010
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C0910087


Abstract: 
single-channel separation (SCS) is a challenging scenario where the objective is to segregate speaker signals from their mixture with high accuracy. In this research a novel framework called subband perceptually weighted transformation (SPWT) is developed to offer a perceptually relevant feature to replace the commonly used magnitude of the short-time Fourier transform (STFT). The main objectives of the proposed SPWT are to lower the spectral distortion (SD) and to improve the ideal separation quality. The performance of the SPWT is compared to those obtained using mixmax and Wiener filter methods. A comprehensive statistical analysis is conducted to compare the SPWT quantization performance as well as the ideal separation quality with other features of log-spectrum and magnitude spectrum. Our evaluations show that the SPWT provides lower SD values and a more compact distribution of SD, leading to more acceptable subjective separation quality as evaluated using the mean opinion score.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1] Bach, F.R., Jordan, M.I., 2006. Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res., 7(1):1963-2001.

[2] Barker, J., Shao, X., 2007. Audio-Visual Speech Fragment Decoding. Proc. Int. Conf. on Auditory-Visual Speech Processing, p.37-42.

[3] Barker, J., Cooke, M., Ellis, D., 2005. Decoding speech in the presence of other sources. Speech Commun., 45(1):5-25.

[4] Barker, J., Coy, A., Ma, N., Cooke, M., 2006. Recent Advances in Speech Fragment Decoding Techniques. 9th Int. Conf. on Spoken Language Processing, p.85-88.

[5] Benaroya, L., Bimbot, F., Gribonval, R., 2006. Audio source separation with a single sensor. IEEE Trans. Audio Speech Lang. Process., 14(1):191-199.

[6] Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer, New York, USA, p.2-3.

[7] Chatterjee, S., Sreenivas, T.V., 2008. Predicting VQ performance bound for LSF coding. IEEE Signal Process. Lett., 15(1):166-169.

[8] Chhikara, R., Folks, L., 1989. The Inverse Gaussian Distribution: Theory, Methodology and Applications. CRC Press, Marcel Dekker Inc., New York, USA, p.39-52.

[9] Christensen, M.G., Jakobsson, A., 2009. Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing. Morgan and Claypool Publishers, San Rafael, CA, USA, p.1-24.

[10] Cooke, M.P., Barker, J., Cunningham, S.P., Shao, X., 2006. An audiovisual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am., 120(5):2421-2424.

[11] Ellis, D.P.W., Weiss, R.J., 2006. Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.957-960.

[12] Gardner, W., Rao, B., 1995. Theoretical analysis of the high rate vector quantization of LPC parameters. IEEE Trans. Speech Audio Process., 3(5):367-381.

[13] Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, USA, p.345-372.

[14] Gray, R.M., 1990. Source Coding Theory. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, USA, p.43.

[15] Gu, L.Y., Stern, R.M., 2008. Single-Channel Speech Separation Based on Modulation Frequency. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.25-28.

[16] Hai, L.V., Lois, L., 1998. A New General Distance Measure for Quantization of LSF and Their Transformed Coefficients. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.45-48.

[17] Hendriks, R.C., Rainer, M., 2007. MAP estimators for speech enhancement under normal and Rayleigh inverse Gaussian distributions. IEEE Trans. Audio Speech Lang. Process., 15(3):918-927.

[18] Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 87(4):1738-1752.

[19] Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech Audio Process., 2(4):578-589.

[20] Hu, G., Wang, D., 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neur. Networks, 15(5):1135-1150.

[21] ITU-T P.862, 2001. Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. International Telecommunication Union, Geneva.

[22] Jensen, J., Heusdens, R., Jensen, S.H., 2003. A Perceptual Subspace Method for Sinusoidal Speech and Audio Modeling. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.401-404.

[23] Kondoz, A.M., Evans, B.G., 1987. Hybrid Transform Coder for Low Bit Rate Speech Coding. Proc. European Conf. on Speech Technology, p.105-108.

[24] Kondoz, A.M., Evans, B.G., 1988. A Robust Vector Quantized Sub-Band Coder for Good Quality Speech Coding at 9.6 Kb/s. IEEE 8th European Conf. on Area Communication, p.44-47.

[25] Kristijansson, T., Hershey, J., Olsen, P., Rennie, S., Gopinath, R., 2006. Super-Human Multi-Talker Speech Recognition: The IBM Speech Separation Challenge System. 9th Int. Conf. on Spoken Language Processing, p.97-100.

[26] Li, P., Guan, Y., Wang, S., Xu, B., Liu, W., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput. Speech & Lang., 24(1):30-44.

[27] Loizou, P., 2007. Speech Enhancement Theory and Practice. CRC Press, Boca Raton, FL, USA, p.143.

[28] Ma, J., Hu, Y., Loizou, P., 2009. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J. Acoust. Soc. Am., 125(5):3387-3405.

[29] Martin, R., 2005. Speech enhancement based on minimum square error estimation and super-Gaussian priors. IEEE Trans. Speech Audio Process., 13(5):845-856.

[30] Moore, B.C.J., 1997. An Introduction to the Psychology of Hearing (4th Ed.). Academic Press, New York, San Diego, USA, p.89-103.

[31] Mowlaee, P., Sayadiyan, A., 2008. Model-Based Monaural Sound Separation by Split-VQ of Sinusoidal Parameters. 16th European Signal Processing Conf., p.1-5.

[32] Mowlaee, P., Sayadiyan, A., 2009. Performance Evaluation for Transform Domain Model-Based Single-Channel Speech Separation. 7th ACS/IEEE Int. Conf. on Computer Systems and Applications, p.935-942.

[33] Paliwal, K.K., Kleijn, W.B., 1995. Quantization of LPC Parameters. In: Kleijn, W.B., Paliwal, K.K. (Eds.), Speech Coding and Synthesis. Elsevier, Amsterdam, the Netherlands, p.443-466.

[34] Radfar, M.H., Sayadiyan, A., Dansereau, R.M., 2006a. A New Algorithm for Two-Talker Pitch Tracking in Single Channel Paradigm. Int. Conf. on Signal Processing.

[35] Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2006b. Performance Evaluation of Three Features for Model-Based Single Channel Speech Separation Problem. 8th Int. Conf. on Spoken Language Processing, p.2610-2613.

[36] Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2007. A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP J. Audio Speech Music Process., 2007:Article ID 84186, p.1-15.

[37] Reddy, A.M., Raj, B., 2007. Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process., 15(6):1766-1776.

[38] Roweis, S., 2003. Factorial Models and Refiltering for Speech Separation and Denoising. 8th European Conf. on Speech Communication and Technology, p.1009-1012.

[39] So, S., Paliwal, K., 2007. A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding. Dig. Signal Process., 17(1):114-137.

[40] Spiegel, M.R., Lipschutz, S., Liu, J., 1998. Schaum’s Mathematical Handbook of Formulas and Tables. McGraw-Hill, New York, USA, p.111.

[41] Srinivasan, S., Wang, D., 2008. A model for multitalker speech perception. J. Acoust. Soc. Am., 124(5):3213-3224.

[42] Srinivasan, S., Shao, Y., Jin, Z., Wang, D.L., 2006. A Computational Auditory Scene Analysis System for Robust Speech Recognition. 9th Int. Conf. on Spoken Language Processing, p.73-76.

[43] Stevens, J.C., Marks, L.E., 1965. Cross-modality matching of brightness and loudness. PNAS, 54(2):407-411.

[44] Tolonen, T., Karjalainen, M., 2000. A computationally efficient multipitch analysis model. IEEE Trans. Speech Audio Process., 8(6):708-716.

[45] Wang, D.L., Brown, G.J., 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley/IEEE Press, New Jersey, USA, p.1-72.

[46] Wu, M., Wang, D.L., Brown, G.J., 2003. A multipitch tracking algorithm for noisy speech. IEEE Trans. Speech Audio Process., 11(3):229-241.

[47] Zavarehei, E., Vaseghi, S., Qin, Y., 2007. Noisy speech enhancement using harmonic-noise model and codebook-based post-processing. IEEE Trans. Audio Speech Lang. Process., 15(4):1194-1203.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

A. Barari@ Department of Civil Engineering Aalborg University, Denmark Editor-in-chief of Int. J. of Int. Commu. in Civil Eng. (ICCE) Editor-in-chief of Int. J. of Math. and Com. Editor of Int. J. of Res. and Rev. in Appl. Sciences http://www.arpapress.com/ijrr<ab@civil.aau.dk>

2010-02-24 18:12:14

This paper present a subband transformation to the previously used STFT features in single-channel speech separation problem. Through experiments and by analyzing the upper-bound performance of different features, this paper shows that the proposed subband transformation improves the separation performance as well as the quantization behaviour. The proposed transform-based method is compared to the well-known mask-based and STFT-based method and it is demonstrated that the proposed method achieves a higher perceived speech quality in the separated signals compared to other benchmarks methods in single-channel speech separation problem.

Reihaneh Lavafi@PhD student of North Dakota State University & Editor for International Journal of Digital Multimedia Broadcasting (Hindawi Publishing Corporation)<reihaneh.lavafi@ndsu.edu>

2010-02-24 06:48:29

In this paper, a novel approach is proposed to solve single-channel speech separation. To this end, the predominantly used STFT features are replaced by the transformation based features. Through experiments, it is shown that using these features lead to outperformance in speech separation performance and the resulting perceived speech quality. Listening experiments also show that the proposed approach results in improved performance compared to previous methods of mask-based (both binary mask and Wiener filtering) and STFT-based methods. I recommend to read this paper since introduction and literature review is very comprehensive and the simulation results are very insightful. Upper-bound performance for separation is also studied.

Alireza @PhD student at Aalborg university<alr@iet.aau.dk>

2010-02-22 08:15:54

I highly recommend this paper for researchers in the field of single-channel speech separation. Since this paper presents a comprehensive literature review in its introduction, it is strongly recommended for those new to the field and is very helpful in this viewpoint. It also presents new results to confirm the outperformance of the proposed method compared to other previous methods.

Mehdi Hosseini@M.Sc student at electrical engineering, Amirkabir University of Technology<sm.hosseini62@gmail.com>

2010-02-21 22:03:06

This paper proposes a novel approach to solve open problem of single-channel speech separation. To this end, they introduced subband features in perceptual subbands and replaced the pre-dominantly used STFT features in previous papers. The paper derives the subband solution based on the theoretical derivations. They confirmed the outperformance of the proposed method through PESQ scores and informal llistening tests. The paper also presents the spectral distortion results of the STFT and the proposed SPWT approach showing the upper-bounds of the speech separation problem.

Emin Devrim fidan@Master Student on Youth Policy in Dumlupinar University<eminfidan@gmail.com>

2010-02-20 22:42:12

This paper focuses on presenting a novel approach to tackle an already open problem, single-channel speech separation. The paper presents comprehensive and insighful information in its introduction. The proposed subband transformation is derived based on theoretical derivations and the outperformance of the proposed method is confirmed through evaluting the perofrmance in terms of listening tests and PESQ scores. Spectral distortion is also used as an objective term to show the effetiveness of the proposed approach in terms of quantization behavior compared to the pre-dominantly used STFT feature vectors in separation problem.

A.Kimiaeifar@Editor of international Journal of Research and Reviews in Applied Sciences Editor of International Journal of Mathematics & Computation Department of Mechanical Engineering, Aalborg University, Pontoppidanstraede 101, DK-9220 Aalborg East, Denmark<a.kimiaeifar@gmail.com>

2010-02-20 22:13:39

The paper presents a novel approach to single-channel speech separation which is an open problem in signal and speecg processing for decades.
Introducing subband transformation, the paper shows improved results compared to the pre-dominantly used STFT-based and mask-based methods previously used as single-channel speech separation methods. The paper is based on firm mathematical derivations and comprehensive simulation and experimental results to confirm the outperformance of the proposed method. More specifically, a authors provided a webpage including the wave files for assessing the perceived speech quality of the separated signals and in order to compare the output files of the proposed method with those obtained by other separation methods. The listening results along with the PESQ scores show outperformance of the proposed method compared
to other benchmarks. The paper also presents the upper-bound performance obtained by the quantizer used. More specifically, the quantization performance for STFT and the proposed SPWT is evaluated for different code book size and other parameter setting.
All and all, I really enjoyed the introduction part which was very insightful as a literature review for me on single-channel speech separation. Besides, the discussions and the simulation results carefully show how the proposed method achieves a better performance compared to other methods.

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE