The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music.

The "bag-of-frames" approach (BOF) to audio pattern recognition represents signals as the long-term statistical distribution of their local spectral features. This approach has proved nearly optimal for simulating the auditory perception of natural and human environments (or soundscapes), and is also the most predominent paradigm to extract high-level descriptions from music signals. However, recent studies show that, contrary to its application to soundscape signals, BOF only provides limited performance when applied to polyphonic music signals. This paper proposes to explicitly examine the difference between urban soundscapes and polyphonic music with respect to their modeling with the BOF approach. First, the application of the same measure of acoustic similarity on both soundscape and music data sets confirms that the BOF approach can model soundscapes to near-perfect precision, and exhibits none of the limitations observed in the music data set. Second, the modification of this measure by two custom homogeneity transforms reveals critical differences in the temporal and statistical structure of the typical frame distribution of each type of signal. Such differences may explain the uneven performance of BOF algorithms on soundscapes and music signals, and suggest that their human perception rely on cognitive processes of a different nature.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  François Pachet,et al.  Automatic Recognition of Urban Sound Sources , 2006 .

[3]  Ben P. Milner,et al.  Context awareness using environmental noise classification , 2003, INTERSPEECH.

[4]  M. Casey,et al.  MPEG-7 sound-recognition tools , 2001, IEEE Trans. Circuits Syst. Video Technol..

[5]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[6]  Anssi Klapuri,et al.  Recognition of Everyday Auditory Scenes: Potentials, Latencies and Cues , 2001 .

[7]  Craig I. Watson,et al.  The myth of goats :: how many people have fingerprints that are hard to match? , 2005 .

[8]  Didier Dufournet,et al.  Automatic noise source recognition , 1998 .

[9]  François Pachet,et al.  Popular music access: The Sony music browser , 2004, J. Assoc. Inf. Sci. Technol..

[10]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[11]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[12]  François Pachet,et al.  The influence of polyphony on the dynamical modelling of musical timbre , 2007, Pattern Recognit. Lett..

[13]  R. K. Reddy,et al.  Categorization of environmental sounds , 2009, Biological Cybernetics.

[14]  François Pachet,et al.  A scale-free distribution of false positives for a large class of audio similarity measures , 2008, Pattern Recognit..

[15]  Douglas A. Reynolds,et al.  SHEEP, GOATS, LAMBS and WOLVES A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation , 1998 .

[16]  P. Janata,et al.  Listening to polyphonic music recruits domain-general attention and working memory circuits , 2002, Cognitive, affective & behavioral neuroscience.

[17]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  J. Ballas Common factors in the identification of an assortment of brief everyday sounds. , 1993, Journal of experimental psychology. Human perception and performance.

[19]  D. Dubois,et al.  A cognitive approach to urban soundscapes : Using verbal data to access everyday life auditory categories , 2006 .

[20]  Lie Lu,et al.  Automatic mood detection from acoustic music data , 2003, ISMIR.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Vincent Fontaine,et al.  Automatic classification of environmental noise events by hidden Markov models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  C. Lavandier,et al.  The contribution of sound source characteristics in the assessment of urban soundscapes , 2006 .

[24]  George Tzanetakis,et al.  Automatic Musical Genre Classification of Audio Signals , 2001, ISMIR.

[25]  Peter Kabal,et al.  Frame level noise classification in mobile environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[27]  Gaël Richard,et al.  On the Usefulness of Differentiated Transient/Steady-state Processing in Machine Recognition of Musical Instruments , 2005 .

[28]  Hsin-Min Wang,et al.  Towards Automatic Identification Of Singing Language In Popular Music Recordings , 2004, ISMIR.

[29]  G. Soete,et al.  Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes , 1995, Psychological research.

[30]  D. Botteldooren,et al.  The quiet rural soundscape and how to characterize it , 2006 .

[31]  Vincent Fontaine,et al.  AUTOMATIC CLASSIFICATION OF ENVIRONMENTAL NOISE EVENTS BY HIDDEN MARKOV MODELS , 1998 .

[32]  D. Levitin,et al.  Ecological validity of soundscape reproduction , 2004 .