论文信息 - Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker Comparison

Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker Comparison

We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analysis. This study extends the frame mask approach to speech signals. For voiced speech, the fundamental frequency is usually changing consecutively over time. We employ NSGT with pitch-dependent and therefore time-varying frequency resolution to attain harmonic alignment in the transform domain and hence yield harmonic-aligned frame masks for speech signals. We propose to apply the harmonic-aligned frame mask to content-dependent speaker comparison. Frame masks, computed from voiced signals of a same vowel but from different speakers, were utilized as similarity measures to compare and distinguish the speaker identities (SID). Results obtained with deep neural networks demonstrate that the proposed frame mask is valid in representing speaker characteristics and shows a potential for SID applications in limited data scenarios.

Feng Huang | Péter Balázs | P. Balázs | F. Huang

[1] Bruno Torrésani,et al. Time-frequency multipliers for sound synthesis , 2007, SPIE Optical Engineering + Applications.

[2] 张国亮,et al. Comparison of Different Implementations of MFCC , 2001 .

[3] Richard Kronland-Martinet,et al. A Class of Algorithms for Time-Frequency Multiplier Estimation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Monika Dörfler,et al. A Phase Vocoder Based on Nonstationary Gabor Frames , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Diana T. Stoeva,et al. Invertibility of multipliers , 2009, 0911.2783.

[6] Feng Huang,et al. Dictionary learning for pitch estimation in speech signals , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[7] Bernhard Laback,et al. Time–Frequency Sparsity by Removing Perceptually Irrelevant Components Using a Simple Model of Simultaneous Masking , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Gerald Matz,et al. Time-frequency formulation, design, and implementation of time-varying optimal filters for signal estimation , 2000, IEEE Trans. Signal Process..

[9] Nicki Holighaus,et al. The Large Time-Frequency Analysis Toolbox 2.0 , 2013, CMMR.

[10] Thomas Grill,et al. A Framework for Invertible, Real-Time Constant-Q Transforms , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12] Stéphane Mallat,et al. A Wavelet Tour of Signal Processing - The Sparse Way, 3rd Edition , 2008 .

[13] Prashant Parikh. A Theory of Communication , 2010 .

[14] Bruno Torrésani,et al. The Linear Time Frequency Analysis Toolbox , 2012, Int. J. Wavelets Multiresolution Inf. Process..

[15] Nicki Holighaus,et al. Theory, implementation and applications of nonstationary Gabor frames , 2011, J. Comput. Appl. Math..

[16] P. Casazza. THE ART OF FRAME THEORY , 1999, math/9910168.

[18] Feng Huang,et al. Pitch Estimation in Noisy Speech Using Accumulated Peak Spectrum and Sparse Estimation Technique , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Christina Gloeckner. Foundations Of Time Frequency Analysis , 2016 .

[20] O. Christensen. An introduction to frames and Riesz bases , 2002 .

[21] T. Strohmer,et al. Gabor Analysis and Algorithms: Theory and Applications , 1997 .

[22] Piotr Majdak,et al. A time-frequency method for increasing the signal-to-noise ratio in system identification with exponential sweeps , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] P. Balázs. Basic definition and properties of Bessel multipliers , 2005, math/0510091.

[24] Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[25] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.