Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference

Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus.

[1]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[2]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[3]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Maurice Charbit,et al.  Factorial Scaled Hidden Markov Model for polyphonic audio representation and source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[6]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[7]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[8]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[9]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[12]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[13]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  John R. Hershey,et al.  Hierarchical variational loopy belief propagation for multi-talker speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[16]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  John R. Hershey,et al.  The Iroquois model: using temporal dynamics to separate speakers , 2006, SAPA@INTERSPEECH.

[18]  Alain de Cheveigné,et al.  Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancell , 1993 .

[19]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[20]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[21]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[22]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[23]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[24]  Franz Pernkopf,et al.  EM-Based Gain Adaptation for Probabilistic Multipitch Tracking , 2011, INTERSPEECH.

[25]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[26]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[27]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[28]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[29]  CookeMartin,et al.  Monaural speech separation and recognition challenge , 2010 .

[30]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[31]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[32]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[33]  Hirokazu Kameoka,et al.  Bayesian nonparametric spectrogram modeling based on infinite factorial infinite hidden Markov model , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[34]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[35]  Franz Pernkopf,et al.  A Probabilistic Interaction Model for Multipitch Tracking With Factorial Hidden Markov Models , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[37]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[39]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[40]  Daniel P. W. Ellis,et al.  A variational EM algorithm for learning eigenvoice parameters in mixed signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Richard M. Dansereau,et al.  Scaled factorial hidden Markov models: A new technique for compensating gain differences in model-based single channel speech separation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  VirtanenTuomas Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007 .

[43]  Franz Pernkopf,et al.  A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario , 2011, INTERSPEECH.

[44]  Mads Græsbøll Christensen,et al.  Synthesis Lectures on Speech and Audio Processing , 2010 .

[45]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[46]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .