Modulation-scale analysis for content identification

For nonstationary signal classification, e.g., speech or music, features are traditionally extracted from a time-shifted, yet short data window. For many applications, these short-term features do not efficiently capture or represent longer term signal variation. Partially motivated by human audition, we overcome the deficiencies of short-term features by employing modulation-scale analysis for long-term feature analysis. Our analysis, which uses time-frequency theory integrated with psychoacoustic results on modulation frequency perception, not only contains short-term information about the signals, but also provides long-term information representing patterns of time variation. This paper describes these features and their normalization. We demonstrate the effectiveness of our long-term features over conventional short-term features in content-based audio identification. A simulated study using a large data set, including nearly 10 000 songs and requiring over a billion audio pairwise comparisons, shows that modulation-scale features improves content identification accuracy substantially, especially when time and frequency distortions are imposed.

[1]  D. Massaro Preperceptual images, processing time, and perceptual units in auditory perception. , 1972, Psychological review.

[2]  James L. Flanagan,et al.  Digital coding of speech in sub-bands , 1976, The Bell System Technical Journal.

[3]  T. Houtgast,et al.  Predicting speech intelligibility in rooms from the modulation transfer function, I. General room acoustics , 1980 .

[4]  Tammo Houtgast,et al.  Predicting speech intelligibility in rooms from the modulation transfer function. Part 4: A ray tracing computer model , 1980 .

[5]  Robert M. Gray,et al.  Minimum Cross-Entropy Pattern Classification and Cluster Analysis , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Julius S. Bendat,et al.  Random Data - Analysis and Measurement Procedures - Second Edition (revised and expanded) , 1986 .

[7]  William A. Gardner,et al.  Statistical spectral analysis : a nonprobabilistic theory , 1986 .

[8]  T. Houtgast Frequency selectivity in amplitude-modulation detection. , 1989, The Journal of the Acoustical Society of America.

[9]  S. Sheft,et al.  Temporal integration in amplitude modulation detection. , 1990, The Journal of the Acoustical Society of America.

[10]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[11]  D. H. Kil,et al.  Pattern recognition and prediction with applications to signal characterization , 1996 .

[12]  R. P. Ramachandran,et al.  Robust speaker recognition: a feature-based approach , 1996, IEEE Signal Processing Magazine.

[13]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[14]  Steven Greenberg,et al.  The modulation spectrogram: in pursuit of an invariant representation of speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  H. Hermansky,et al.  The modulation spectrum in the automatic recognition of speech , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[16]  Hynek Hermansky,et al.  Should recognizers have ears? , 1998, Speech Commun..

[17]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  T. Dau,et al.  Characterizing frequency selectivity for envelope fluctuations. , 2000, The Journal of the Acoustical Society of America.

[19]  Stan Z. Li,et al.  Content-based audio classification and retrieval using the nearest feature line method , 2000, IEEE Trans. Speech Audio Process..

[20]  Randall K Fish Dynamic models of machining vibrations, designed for classification of tool wear , 2001 .

[21]  Constantin Papaodysseus,et al.  On the automated recognition of seriously distorted musical recordings , 2001, IEEE Trans. Signal Process..

[22]  Jaap A. Haitsma,et al.  Robust Audio Hashing for Content Identification , 2001 .

[23]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[24]  Ramarathnam Venkatesan,et al.  A Perceptual Audio Hashing Algorithm: A Tool for Robust Audio Identification and Information Hiding , 2001, Information Hiding.

[25]  Les E. Atlas,et al.  Scalable and progressive audio codec , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[26]  Jürgen Herre,et al.  AudioID: Towards Content-Based Identification of Audio Material , 2001 .

[27]  Helmut Neuschmied,et al.  Robust Sound Modeling for Song Detection in Broadcast Audio , 2002 .

[28]  Les E. Atlas,et al.  Modulation frequency features for audio fingerprinting , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Pedro Cano,et al.  A review of algorithms for audio fingerprinting , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[30]  Xavier Rodet,et al.  Toward Automatic Music Audio Summary Generation from Signal Analysis , 2002, ISMIR.

[31]  Khaled H. Hamed,et al.  Time-frequency analysis , 2003 .

[32]  Jonathan Goldstein,et al.  Indexing High Dimensional Rectangles for Fast Multimedia Identification , 2003 .

[33]  John C. Platt,et al.  Distortion discriminant analysis for audio fingerprinting , 2003, IEEE Trans. Speech Audio Process..

[34]  Juan Carlos,et al.  Review of "Discrete-Time Speech Signal Processing - Principles and Practice", by Thomas Quatieri, Prentice-Hall, 2001 , 2003 .

[35]  Les E. Atlas,et al.  Non-stationary signal classification using joint frequency analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[36]  L. Cazzanti,et al.  Automatic identification of sound recordings , 2004, IEEE Signal Processing Magazine.