Detection and modeling of transient audio signals with prior information

Many musical audio signals are well represented as a sum of sinusoids with slowly varying parameters. This representation has uses in audio coding, time and pitch scale modification, and automated music analysis, among other areas. Transients (events where the spectral content changes abruptly, or regions for which spectral content is best modeled as undergoing persistent change) pose particular challenges for these applications. We aim to detect abrupt-change transients, identify transient region boundaries, and develop new representations utilizing these detection capabilities to reduce perceived artifacts in time and pitch scale modifications. In particular, we introduce a hybrid sinusoidal/source-filter model which faithfully reproduces attack transient characteristics under time and pitch modifications. The detection tasks prove difficult for sufficiently complex and heterogeneous musical signals. Fortunately, musical signals are highly structured—both at the signal level, in terms of the spectrotemporal structure of note events, and at higher levels, in terms of melody and rhythm. These structures generate context useful in predicting attributes such as pitch content, the presence and location of abrupt-change transients associated with musical onsets, and the boundaries of transient regions. To this end, a dynamic Bayesian framework is proposed for which contextual predictions may be integrated with signal information in order to make optimal decisions concerning these attributes. The result is a joint segmentation and melody retrieval for nominally monophonic signals. The system detects note event boundaries and pitches, also yielding a frame-level sub-segmentation of these events into transient/steady-state regions. The approach is successfully applied to notoriously difficult examples like bowed string recordings captured in highly reverberant environments. The proposed transcription engine is driven by a probabilistic model of short-time Fourier transform peaks given pitch content hypotheses. The model proves robust to missing and spurious peaks as well as uncertainties about timbre and inharmonicity. The peaks' likelihood evaluation marginalizes over a number of observation-template linkages exponential in the number of observed peaks; to remedy this, a Markov-chain Monte Carlo (MCMC) traversal is developed which yields virtually identical results with greatly reduced computation.

[1]  Julius O. Smith,et al.  A switched parametric and transform audio coder , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Bernd Edler,et al.  Object-Based Analysis/Synthesis Audio Coder for Very Low Bit Rates , 1998 .

[3]  Kunio Kashino,et al.  Bayesian estimation of simultaneous musical notes based on frequency domain modelling , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Elaine Chew,et al.  Mapping Midi to the Spiral Array: Disambiguating Pitch Spellings , 2003 .

[5]  Xavier Rodet,et al.  CONTROL PARAMETER ESTIMATION FOR A PHYSICAL MODEL OF A TRUMPET USING PATTERN RECOGNITION , 2002 .

[6]  C. Krumhansl Music Psychology and Music Theory: Problems and Prospects , 1995 .

[7]  Arnaud Doucet,et al.  Convergence of Sequential Monte Carlo Methods , 2007 .

[8]  Robert B. Dunn,et al.  A subband approach to time-scale expansion of complex acoustic signals , 1995, IEEE Trans. Speech Audio Process..

[9]  Simon J. Godsill,et al.  On sequential simulation-based methods for Bayesian filtering , 1998 .

[10]  Julius O. Smith,et al.  Audio representations for data compression and compressed domain processing , 1998 .

[11]  Simon J. Godsill,et al.  Bayesian harmonic models for musical pitch estimation and analysis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Julius O. Smith,et al.  Spectral Modeling Synthesis , 1989, ICMC.

[13]  Matti Karjalainen,et al.  Frequency-Zooming ARMA Modeling of Resonant and Reverberant Systems , 2002 .

[14]  Nadine Martin,et al.  Spectrogram segmentation by means of statistical features for non-stationary signal interpretation , 2002, IEEE Trans. Signal Process..

[15]  Mototsugu Abe,et al.  Design Criteria for Simple Sinusoidal Parameter Estimation Based on Quadratic Interpolation of FFT Magnitude Peaks , 2004 .

[16]  Matti Karjalainen,et al.  Frequency-Zooming ARMA Modeling for Analysis of Noisy String Instrument Tones , 2003, EURASIP J. Adv. Signal Process..

[17]  Jonathan Foote,et al.  Audio Retrieval by Rhythmic Similarity , 2002, ISMIR.

[18]  Marina Bosi,et al.  Overview of MPEG audio : Current and future standards for low-bit-rate audio coding , 1997 .

[19]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[20]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[21]  Marina Bosi,et al.  Introduction to Digital Audio Coding and Standards , 2004, J. Electronic Imaging.

[22]  Matthew E. P. Davies,et al.  A Combined Phase and Amplitude Based Approach to Onset Detection for Audio Segmentation , 2003 .

[23]  Christopher Raphael,et al.  Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Christopher Raphael,et al.  Automatic Transcription of Piano Music , 2002, ISMIR.

[25]  Fabien Gouyon,et al.  A Flexible Analysis-Synthesis Method for Transients , 2000, ICMC.

[26]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[27]  Simon J. Godsill,et al.  Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[28]  Karlheinz Brandenburg,et al.  MP3 and AAC Explained , 1999 .

[29]  David Barber,et al.  Generative model based polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[30]  E. Schellenberg,et al.  Simplifying the Implication-Realization Model of Melodic Expectancy , 1997 .

[31]  Nando de Freitas,et al.  Beat Tracking the Graphical Model Way , 2004, NIPS.

[32]  M. Slaney,et al.  PERCEPTUAL DISTANCE IN TIMBRE SPACE , 2005 .

[33]  James A. Moorer,et al.  The Use of the Phase Vocoder in Computer Music Applications , 1976 .

[34]  Régine André-Obrecht,et al.  A new statistical approach for the automatic segmentation of continuous speech signals , 1988, IEEE Trans. Acoust. Speech Signal Process..

[35]  M. Davies,et al.  Complex domain onset detection for musical signals , 2003 .

[36]  Julius O. Smith,et al.  Multiresolution sinusoidal modeling for wideband audio with modifications , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  H. Akaike A new look at the statistical model identification , 1974 .

[38]  M. Portnoff,et al.  Time-scale modification of speech based on short-time Fourier analysis , 1981 .

[39]  W. Andrew Schloss,et al.  On the automatic transcription of percussive music , 1985 .

[40]  Julius O. Smith,et al.  PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation , 1987, ICMC.

[41]  Ali Taylan Cemgil,et al.  Bayesian Music Transcription , 1997 .

[42]  M. Mathews,et al.  Analysis of musical‐instrument tones , 1969 .

[43]  M. Basseville,et al.  Edge detection using sequential methods for change in level--Part II: Sequential detection of change in mean , 1981 .

[44]  M. Portnoff,et al.  Implementation of the digital phase vocoder using the fast Fourier transform , 1976 .

[45]  Jean Laroche,et al.  Audio segmentation by feature-space clustering using linear discriminant analysis and dynamic programming , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[46]  Anthony L Bertapelle Spectral Analysis of Time Series. , 1979 .

[47]  Ali H. Sayed,et al.  Linear Estimation (Information and System Sciences Series) , 2000 .

[48]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[49]  Xavier Serra,et al.  A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition , 1989 .

[50]  Mike E. Davies,et al.  Improved Time-Scaling of Musical Audio Using Phase Locking at Transients , 2002 .

[51]  Jonathan Berger,et al.  Modeling the Degree of Realized Expectation in Functional Tonal Music: A Study of Perceptual and Cognitive Modeling Using Neural Networks , 1996, ICMC.

[52]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[53]  Shigeki Sagayama,et al.  Automatic rhythm transcription from multiphonic MIDI signals , 2003, ISMIR.

[54]  Jean Laroche,et al.  Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[55]  Simon Dixon,et al.  Automatic Extraction of Tempo and Beat From Expressive Performances , 2001 .

[56]  William J. Fitzgerald,et al.  Markov chain Monte Carlo methods with applications to signal processing , 2001, Signal Process..

[57]  R. Kronland-Martinet,et al.  Piano string modeling: From partial differential equations to digital wave‐guide model , 2002 .

[58]  Jean Laroche A new analysis/synthesis system of musical signals using Prony's method-application to heavily damped percussive sounds , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[59]  J. L. Goldstein An optimum processor theory for the central formation of the pitch of complex tones. , 1973, The Journal of the Acoustical Society of America.

[60]  Robert Boorstyn,et al.  Single tone parameter estimation from discrete-time observations , 1974, IEEE Trans. Inf. Theory.

[61]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[62]  Eugene Narmour,et al.  The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model , 1990 .

[63]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[64]  J. Pampin ATS: a Lisp Environment for Spectral Modeling , 1999, ICMC.

[65]  Stefania Serafin,et al.  The sound of friction: Real-time models, playability and musical applications , 2004 .

[66]  Stephen W. Hainsworth,et al.  Techniques for the Automated Analysis of Musical Audio , 2004 .

[67]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Roger B. Dannenberg,et al.  Tracking Musical Beats in Real Time , 1990, ICMC.

[69]  Julius O. Smith,et al.  A Sines+Transients+Noise Audio Representation for Data Compression and Time/Pitch Scale Modifications , 1998 .

[70]  Lippold Haken,et al.  Transient Preservation Under Transformation in an Additive Sound Model , 2000, ICMC.

[71]  Harvey Thornburg,et al.  An iterative filterbank approach for extracting sinusoidal parameters from quasi-harmonic sounds , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[72]  Jean Laroche,et al.  Phase-vocoder: about this phasiness business , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[73]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[74]  Masataka Goto,et al.  An Audio-based Real-time Beat Tracking System for Music With or Without Drum-sounds , 2001 .

[75]  Stephen McAdams,et al.  Musical Forces and Melodic Expectations: Comparing Computer Models and Experimental Results , 2004 .

[76]  J. O. Smith,et al.  Joint estimation of vocal tract filter and glottal source waveform via convex optimization , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[77]  S. Lauritzen Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models , 1992 .

[78]  Dennis Gabor,et al.  Theory of communication , 1946 .

[79]  Kunio Kashino,et al.  Application of the Bayesian probability network to music scene analysis , 1998 .

[80]  Julius O. Smith,et al.  Bayesian identification of closely-spaced chords frim single-frame STFT peaks , 2004 .

[81]  Xavier Rodet Stability/Instability of Periodic Solutions and Chaos in Physical Modles of Musical Instruments , 1994, ICMC.

[82]  Xavier Rodet,et al.  Inversion of a physical model of a trumpet , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[83]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[84]  Michèle Basseville,et al.  Sequential detection of abrupt changes in spectral characteristics of digital signals , 1983, IEEE Trans. Inf. Theory.

[85]  Shlomo Dubnov,et al.  Maximum a-posteriori probability pitch tracking in noisy environments using harmonic model , 2004, IEEE Transactions on Speech and Audio Processing.

[86]  M. Basseville,et al.  Edge detection using sequential methods for change in level--Part I: A sequential edge detection algorithm , 1981 .

[87]  Julius O. Smith,et al.  A flexible sampling-rate conversion method , 1984, ICASSP.

[88]  Harvey Fletcher,et al.  Quality of Piano Tones , 1962 .

[89]  Fred Lerdahl,et al.  Tonal Pitch Space , 2001 .

[90]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[91]  Daniel P. W. Ellis,et al.  Chord Recognition and Segmentation Using EM-trained Hidden Markov Models , 2003 .

[92]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[93]  Vladimir Pavlovic,et al.  A Dynamic Bayesian Network Approach to Tracking Using Learned Switching Dynamic Models , 2000, HSCC.

[94]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[95]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[96]  Michael A. Casey,et al.  Auditory group theory with applications to statistical basis methods for structured audio , 1998 .

[97]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[98]  H. Cramér Mathematical methods of statistics , 1947 .

[99]  Ali Taylan Cemgil Polyphonic Pitch Identification and Bayesian Inference , 2004, ICMC.

[100]  Harvey Thornburg,et al.  ANALYSIS AND RESYNTHESIS OF QUASI-HARMONIC SOUNDS: AN ITERATIVE FILTERBANK APPROACH , 2003 .

[101]  Mark B. Sandler,et al.  Phase-based note onset detection for music signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[102]  Stefania Serafin,et al.  Data driven identification and computer animation of bowed string model , 2001, ICMC.

[103]  Yuan Qi,et al.  Bayesian spectrum estimation of unevenly sampled nonstationary data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[104]  Bernd Edler Codierung von Audiosignalen mit überlappender Transformation und adaptiven Fensterfunktionen , 1989 .

[105]  Julius O. Smith,et al.  Watermarking sinusoidal audio representations by quantization index modulation in multiple frequencies , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[106]  Peter Desain,et al.  On tempo tracking: Tempogram Representation and Kalman filtering , 2000, ICMC.

[107]  Elaine Chew,et al.  Real-Time Pitch Spelling Using the Spiral Array , 2005, Computer Music Journal.