COMPUTATIONALLY MEASURABLE TEMPORAL DIFFERENCES BETWEEN SPEECH AND SONG

Automatic audio signal classification is one of the general research areas in which algorithms are developed to allow computer systems to understand and interact with the audio environment. Human utterance classification is a specific subset of audio signal classification in which the domain of audio signals is restricted to those likely to be encountered when interacting with humans. Speech recognition software performs classification in a domain restricted to human speech, but human utterances can also include singing, shouting, poetry and prosodic speech, for which current recognition engines are not designed. Another recent and relevant audio signal classification task is the discrimination between speech and music. Many radio stations have periods of speech (news, information reports, commercials) interspersed with periods of music, and systems have been designed to search for one type of sound in preference over another. Many of the current systems used to distinguish between speech and music use characteristics of the human voice, so such systems are not able to distinguish between speech and music when the music is an individual unaccompanied singer. This thesis presents research into the problem of human utterance classification, specifically differentiation between talking and singing. The question is addressed: “Are there measurable differences between the auditory waveforms produced by talking and singing?” Preliminary background is presented to acquaint the reader with some of the science used in the algorithm development. A corpus of sounds was collected to study the physical and perceptual differences between singing and talking, and the procedures and results of this collection are presented. A set of 17 features is developed to differentiate between talking and singing, and to investigate the intermediate vocalizations between talking and singing. The results of these features are examined and evaluated.

[1]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[2]  V. S. Subrahmanian,et al.  Multimedia Database Systems , 1993, Artificial Intelligence.

[3]  David Gerhard,et al.  Audio Signal Classification: An Overview , 2002 .

[4]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[5]  C Kaernbach,et al.  Psychophysical evidence against the autocorrelation theory of auditory temporal processing. , 1998, The Journal of the Acoustical Society of America.

[6]  Eduardo Lleida,et al.  Pitch detection and voiced/unvoiced decision algorithm based on wavelet transforms , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  David Gerhard A Human Vocal Utterance Corpus For Perceptual And Acoustic Analysis of Speech , 2002 .

[8]  Vincent Gibiat Phase space representations of acoustical musical signals , 1988 .

[9]  Xavier Rodet,et al.  Tracking of partials for additive sound synthesis using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Douglas D. O'Shaughnessy,et al.  Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male/female classification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  S. Hamid Nawab,et al.  Improved musical pitch tracking using principal decomposition analysis , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  George List,et al.  The Boundaries of Speech and Song , 1963 .

[13]  C.-C. Jay Kuo,et al.  Hierarchical classification of audio data for archiving and retrieving , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Witold Kinsner,et al.  Lossy compression of head and shoulder images using zerotrees of wavelet coefficients , 1996, Proceedings of 1996 Canadian Conference on Electrical and Computer Engineering.

[15]  David Gerhard,et al.  Computer Music Analysis , 1998 .

[16]  Dmitry Terez Fundamental frequency estimation using signal embedding in state space , 2002 .

[17]  James Gleick,et al.  Chaos, Making a New Science , 1987 .

[18]  E. Geoffrois,et al.  The multi-lag-window method for robust extended-range F/sub 0/ determination , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[20]  E. Mang Speech, songs, and intermediate vocalizations : a longitudinal study of preschool children’s vocal development , 1999 .

[21]  John M. Eargle,et al.  Music, sound, and technology , 1990 .

[22]  B. Truax Handbook for Acoustic Ecology , 1980 .

[23]  Jerome M. Shapiro,et al.  Embedded image coding using zerotrees of wavelet coefficients , 1993, IEEE Trans. Signal Process..

[24]  J P Martens,et al.  Pitch and voiced/unvoiced determination with an auditory model. , 1992, The Journal of the Acoustical Society of America.

[25]  Jonathan Harrington,et al.  Speech annotation and corpus tools , 2001, Speech Commun..

[26]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[27]  Yang Lu,et al.  A fast audio classification from MPEG coded data , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[28]  James Gleick Chaos: Making a New Science , 1987 .

[29]  Raymond N. J. Veldhuis,et al.  The effect of speech melody on voice quality , 2001, Speech Commun..

[30]  Ii Leon W. Couch Digital and analog communication systems , 1983 .

[31]  David Gerhard Perceptual features for a fuzzy speech-song classification , 2002, ICASSP.

[32]  Xavier Rodet,et al.  Features extraction and temporal segmentation of acoustic signals , 1998, ICMC.

[33]  B. Galler,et al.  Predicting musical pitch from component frequency ratios , 1979 .

[34]  Philip C. Loizou,et al.  COLEA: A MATLAB software tool for speech analysis , 1998 .

[35]  R. Fay,et al.  Human Psychophysics , 1993, Springer Handbook of Auditory Research.

[36]  John G. Harris,et al.  Human factor cepstral coefficients , 2002 .

[37]  Edouard Geoffrois,et al.  The multi-lag-window method for robust extended-range F0 determination , 1996, ICSLP.

[38]  Elmar Nöth,et al.  Dialog act classification with the help of prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  Martin Piszczalski A computational model of music transcription , 1986 .

[40]  David Gerhard Pitch-based acoustic feature analysis for the discrimination of speech and monophonic singing , 2002 .

[41]  Dmitry E. Terez,et al.  Robust pitch determination using nonlinear state-space embedding , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[43]  R. W. King,et al.  Automatic accent classification of foreign accented Australian English speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[44]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[46]  S. Coren,et al.  In Sensation and perception , 1979 .

[47]  X. Rodet,et al.  Vibrato : detection , estimation , extraction , modi cation , 1999 .

[48]  B. Kosko Fuzzy Thinking: The New Science of Fuzzy Logic , 1993 .

[49]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[50]  Barry Vercoe,et al.  ON THE PERCEIVED COMPLEXITY OF SHORT MUSICAL SEGMENTS , 2000 .

[51]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[52]  Claudio Becchetti,et al.  Speech Recognition: Theory and C++ Implementation , 1999 .

[53]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[54]  C.-C. Jay Kuo,et al.  Heuristic approach for generic audio data segmentation and annotation , 1999, MULTIMEDIA '99.

[55]  Dejan Kulpinski LLE and Isomap analysis of spectra and colour images , 2002 .

[56]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[57]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[58]  Curtis Roads,et al.  The Computer Music Tutorial , 1996 .

[59]  Xavier Rodet,et al.  Fundamental frequency estimation and tracking using maximum likelihood harmonic matching and HMMs , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[60]  J. Lane Pitch Detection Using a Tunable IIR Filter , 1990 .

[61]  Daniel J. Levitin,et al.  ABSOLUTE PITCH: SELF-REFERENCE AND HUMAN MEMORY , 1998 .

[62]  David Gerhard,et al.  Automatic Interval Naming Using Relative Pitch , 1998 .

[63]  R. Jackendoff,et al.  A Generative Theory of Tonal Music , 1985 .

[64]  G.V. Ramana Rao,et al.  Word boundary detection using pitch variations , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  Xavier Rodet,et al.  Estimation of fundamental frequency of musical sound signals , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[66]  B. Keith Jenkins,et al.  A Neural Network Model for Pitch Perception , 1989 .

[67]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[68]  David Gerhard,et al.  Audio Visualization in Phase Space , 1999 .