Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model

A method is described for estimating the fundamental frequencies of several concurrent sounds in polyphonic music and multiple-speaker speech signals. The method consists of a computational model of the human auditory periphery, followed by a periodicity analysis mechanism where fundamental frequencies are iteratively detected and canceled from the mixture signal. The auditory model needs to be computed only once, and a computationally efficient strategy is proposed for implementing it. Simulation experiments were made using mixtures of musical sounds and mixed speech utterances. The proposed method outperformed two reference methods in the evaluations and showed a high level of robustness in processing signals where important parts of the audible spectrum were deleted to simulate bandlimited interference. Different system configurations were studied to identify the conditions where pitch analysis using an auditory model is advantageous over conventional time or frequency domain approaches.

[1]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[2]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[3]  Guy J. Brown,et al.  A blackboard architecture for computational auditory scene analysis , 1999, Speech Commun..

[4]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[5]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[6]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[7]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[8]  Andrew D. Sterian,et al.  Model-based segmentation of time-frequency images for musical transcription. , 1999 .

[9]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[10]  Keith D. Martin,et al.  Automatic Transcription of Simple Polyphonic Music: Robust Front End Processing , 1999 .

[11]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[12]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[13]  Alain de Cheveigné,et al.  Pitch perception models , 2005 .

[14]  E. de Boer,et al.  On cochlear encoding: Potentialities and limitations of the reverse‐correlation technique , 1978 .

[15]  A. de Cheveigné Multiple F0 estimation , 2006 .

[16]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[17]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[18]  Ray Meddis,et al.  Virtual pitch and phase sensitivity of a computer model of the auditory periphery , 1991 .

[19]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[20]  Stephen T. Neely,et al.  Signals, Sound, and Sensation , 1997 .

[21]  Roy D. Patterson,et al.  A FUNCTIONAL MODEL OF NEURAL ACTIVITY PATTERNS AND AUDITORY IMAGES , 2004 .

[22]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[23]  R Meddis,et al.  An evaluation of eight computer models of mammalian inner hair-cell function. , 1991, The Journal of the Acoustical Society of America.

[24]  E. de Boer,et al.  On cochlear encoding: potentialities and limitations of the reverse-correlation technique. , 1978, The Journal of the Acoustical Society of America.

[25]  Tuomas Virtanen,et al.  Unsupervised Learning Methods for Source Separation in Monaural Music Signals , 2006 .

[26]  R. Patterson Auditory filter shapes derived with noise stimuli. , 1976, The Journal of the Acoustical Society of America.

[27]  Anssi Klapuri,et al.  Multiple fundamental frequency estimation based on harmonicity and spectral smoothness , 2003, IEEE Trans. Speech Audio Process..

[28]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[29]  H. Indefrey,et al.  Design and evaluation of double-transform pitch determination algorithms with nonlinear distortion in the frequency domain-preliminary results , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Paul J. Walmsley,et al.  Signal separation of musical instruments: simulation-based methods for musical signal decomposition and transcription , 2001 .

[31]  Guy J. Brown,et al.  Multiple F0 Estimation , 2006 .

[32]  Anssi Klapuri,et al.  Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[33]  A.P. Klapuri,et al.  A perceptually motivated multiple-F0 estimation method , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[34]  Michael A. Casey,et al.  Separation of Mixed Audio Sources By Independent Subspace Analysis , 2000, ICMC.

[35]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Alain de Cheveigné,et al.  Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancell , 1993 .

[37]  Brian C. J. Moore,et al.  Chapter 5 – Frequency Analysis and Masking , 1995 .

[38]  Kunio Kashino,et al.  A Sound Source Separation System with the Ability of Automatic Tone Modeling , 1993, International Conference on Mathematics and Computing.

[39]  Julius O. Smith,et al.  Techniques for Note Identification in Polyphonic Music , 1985, ICMC.

[40]  David Barber,et al.  A generative model for music transcription , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Mark D. Plumbley,et al.  Polyphonic music transcription by non-negative sparse coding of power spectra , 2004 .

[42]  M. Davy,et al.  Bayesian analysis of polyphonic western tonal music. , 2006, The Journal of the Acoustical Society of America.

[43]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..

[44]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[45]  Mark D. Plumbley,et al.  Polyphonic transcription by non-negative sparse coding of power spectra , 2004, ISMIR.

[46]  R. Meddis Simulation of mechanical to neural transduction in the auditory receptor. , 1986, The Journal of the Acoustical Society of America.

[47]  Anssi Klapuri,et al.  Transcription of the Singing Melody in Polyphonic Music , 2006, ISMIR.

[48]  J. Beauchamp,et al.  Fundamental frequency estimation of musical signals using a two‐way mismatch procedure , 1994 .

[49]  B. Delgutte,et al.  Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. , 1996, Journal of neurophysiology.

[50]  Philippe Lepain Polyphonic Pitch Extraction from Musical Signals , 1999 .

[51]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[52]  Hirokazu Kameoka,et al.  Separation of harmonic structures based on tied Gaussian mixture model and information criterion for concurrent sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Matti Karjalainen,et al.  Multi-pitch and periodicity analysis model for sound separation and auditory scene analysis , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[54]  P. Cariani Recurrent timing nets for auditory scene analysis , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[55]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[56]  Martin Piszczalski A computational model of music transcription , 1986 .

[57]  Ma Conway,et al.  Handbook of perception and cognition , 1996 .

[58]  Kenneth Steiglitz,et al.  A digital signal processing primer - with applications to digital audio and computer music , 1996 .

[59]  B. Delgutte,et al.  Neural correlates of the pitch of complex tones. II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch, and the dominance region for pitch. , 1996, Journal of neurophysiology.

[60]  Christopher J. Plack,et al.  Loudness perception and intensity coding , 1995 .