Detection of speech and music based on spectral tracking

How to deal with sounds that include spectrally and temporally complex signals such as speech and music remains a problem in real-world audio information processing. We have devised (1) a classification method based on sinusoidal trajectories for speech and music and (2) a detection method based on (1) for speech with background music. Sinusoidal trajectories represent the temporal characteristics of each category of sounds such as speech, singing voice and musical instrument. From the trajectories, 20 temporal features are extracted and used to classify sound segments into the categories by using statistical classifiers. The average F"1 measure of the classification of nonmixed sounds was 0.939, which might be sufficiently high to apply to subsequent detection of sound categories in a mixed sound. To handle the temporal overlapping of sounds, we also developed an optimal spectral tracking algorithm with low computational complexity; it is based on dynamic programming (DP) with iterative improvement for the sinusoidal decomposition of signals. The classification and detection of a temporal mixture of speech and music are performed by a statistical integration of the temporal features of their trajectories and the optimization of the combination of their categories. The detection method was experimentally evaluated using 400 samples of mixed sounds, and the average of the narrow-band correlation coefficients and improvement in the segmental signal-to-noise ratio (SNR) were 0.55 and +5.67dB, respectively, which show effectiveness of the proposed detection method.

[1]  Thomas Sikora,et al.  How Efficient is MPEG-7 for General Sound Recognition? , 2004 .

[2]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[3]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[4]  Xavier Rodet,et al.  Tracking of partials for additive sound synthesis using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[6]  Carol Y. Espy-Wilson,et al.  Knowledge-based analysis of speech mixed with sporadic environmental sounds , 1998 .

[7]  Takao Kobayashi,et al.  Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Kathy Melih,et al.  Source segmentation for structured audio , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[9]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[10]  Masaaki Honda,et al.  Sinusoidal model based on instantaneous frequency attractors , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Ruben Gonzalez,et al.  Techniques for Improving the Accuracy of Sinusoidal Tracking , 2005, EuroIMSA.

[12]  Katsuhiko Shirai,et al.  Discrimination of speech, musical instruments and singing voices using the temporal patterns of sinusoidal segments in audio signals , 2005, INTERSPEECH.

[13]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[14]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Kari Torkkola,et al.  Blind Separation For Audio Signals - Are We There Yet? , 1999 .

[16]  Kathy Melih,et al.  Audio source type segmentation using a perceptually based representation , 1999, ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359).

[17]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Regunathan Radhakrishnan,et al.  Comparing MFCC and MPEG-7 audio features for feature extraction, maximum likelihood HMM and entropic prior HMM for sports audio classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  T. Taniguchi,et al.  Spectral Frequency Tracking for Classifying Audio Signals , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[22]  Tuomas Virtanen,et al.  Sound Source Separation Using Sparse Coding with Temporal Continuity Objective , 2003, ICMC.

[23]  Liang Gu,et al.  Robust singing detection in speech/music discriminator design , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[24]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[25]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[26]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[27]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  R Drullman,et al.  Temporal envelope and fine structure cues for speech intelligibility. , 1994, The Journal of the Acoustical Society of America.

[29]  Anssi Klapuri,et al.  Separation of harmonic sound sources using sinusoidal modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[30]  Mikio Tohyama,et al.  Signal Representation Including Waveform Envelope by Clustered Line-Spectrum Modeling , 2003 .

[31]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[32]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .