Sound event recognition in unstructured environments using spectrogram image processing

The objective of this research is to develop feature extraction and classification techniques for the task of sound event recognition (SER) in unstructured environments. Although this field is traditionally overshadowed by the popular field of automatic speech recognition (ASR), an SER system that can achieve human-like sound recognition performance opens up a range of novel application areas. These include acoustic surveillance, bio-acoustical monitoring, environmental context detection, healthcare applications and more generally the rich transcription of acoustic environments. The challenge in such environments are the adverse effects such as noise, distortion and multiple sources, which are more likely to occur with distant microphones compared to the close-talking microphones that are more common in ASR. In addition, the characteristics of acoustic events are less well defined than those of speech, and there is no sub-word dictionary available like the phonemes in speech. Therefore, the performance of ASR systems typically degrades dramatically in these challenging unstructured environments, and it is important to develop new methods that can perform well for this challenging task. In this thesis, the approach taken is to interpret the sound event as a two-dimensional spectrogram image, with the two axes as the time and frequency dimensions. This enables novel methods for SER to be developed based on spectrogram image processing, which are inspired by techniques from the field of image processing. The motivation for such an approach is based on finding an automatic approach to “spectrogram reading”, where it is possible for humans to visually recognise the different sound event signatures in the spectrogram. The advantages of such an approach are twofold. Firstly, the sound event image representation makes it possible to naturally capture the sound information in a two-dimensional feature. This has advantages over conventional onedimensional frame-based features, which capture only a slice of spectral information

[1]  Thomas S. Huang,et al.  Real-world acoustic event detection , 2010, Pattern Recognit. Lett..

[2]  Joseph L. Mundy,et al.  Object Recognition in the Geometric Era: A Retrospective , 2006, Toward Category-Level Object Recognition.

[3]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[4]  S. M. Potirakis,et al.  Natural soundscapes and identification of environmental sounds: A pattern recognition approach , 2009, 2009 16th International Conference on Digital Signal Processing.

[5]  Chin-Hui Lee,et al.  Improvements in connected digit recognition using higher order spectral and energy features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[7]  Masataka Goto,et al.  Gradient-based musical feature extraction based on scale-invariant feature transform , 2011, 2011 19th European Signal Processing Conference.

[8]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Vitoantonio Bevilacqua,et al.  A face recognition system based on Pseudo 2D HMM applied to neural network coefficients , 2008, Soft Comput..

[10]  C. Köppl,et al.  Coding of Sound Pressure Level in the Barn Owl's Auditory Nerve , 1999, The Journal of Neuroscience.

[11]  David A. Ross,et al.  Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-by-Example Applications , 2011, ISMIR.

[12]  Frank Kurth,et al.  Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring , 2010, Pattern Recognit. Lett..

[13]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[14]  N N Shankar,et al.  Parts based representation for pedestrian using NMF with robustness to partial occlusion , 2010, 2010 International Conference on Signal Processing and Communications (SPCOM).

[15]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[16]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[17]  Jonathan Z. Simon,et al.  Robust Spectrotemporal Reverse Correlation for the Auditory System: Optimizing Stimulus Design , 2000, Journal of Computational Neuroscience.

[18]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[19]  Emmanuel Deruty,et al.  Sound Indexing Using Morphological Description , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[21]  李幼升,et al.  Ph , 1989 .

[22]  T. Andringa,et al.  Sound event recognition through expectancy-based evaluation ofsignal-driven hypotheses , 2010, Pattern Recognit. Lett..

[23]  Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[25]  Mingjing Li,et al.  Color texture moments for content-based image retrieval , 2002, Proceedings. International Conference on Image Processing.

[26]  C.-C. Jay Kuo,et al.  Where am I? Scene Recognition for Mobile Robots using Audio Features , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[27]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[28]  C.-C. Jay Kuo,et al.  Content/context-adaptive feature selection for environmental sound recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[29]  Peng Li,et al.  Monaural speech separation based on MAXVQ and CASA for robust speech recognition , 2010, Comput. Speech Lang..

[30]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[31]  DeLiang Wang,et al.  An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[33]  Luo Juan,et al.  A comparison of SIFT, PCA-SIFT and SURF , 2009 .

[34]  Janto Skowronek,et al.  Automatic surveillance of the acoustic activity in our living environment , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[35]  Thomas Sikora,et al.  How Efficient is MPEG-7 for General Sound Recognition? , 2004 .

[36]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[37]  Jindong Liu,et al.  Mobile robot broadband sound localisation using a biologically inspired spiking neural network , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Samantha J Barry,et al.  The automatic recognition and counting of cough , 2006, Cough.

[39]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  M. Casey,et al.  MPEG-7 sound-recognition tools , 2001, IEEE Trans. Circuits Syst. Video Technol..

[41]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[42]  Zheru Chi,et al.  Improvement of Image Classification Using Wavelet Coefficients with Structured-Based Neural Network , 2008, Int. J. Neural Syst..

[43]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[44]  C. H. Chen Pattern recognition applications in underwater acoustics , 1984 .

[45]  George Tzanetakis,et al.  Multifeature audio segmentation for browsing and annotation , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[46]  DeLiang Wang,et al.  A model for multitalker speech perception. , 2008, The Journal of the Acoustical Society of America.

[47]  Kamil Behun Image features in music style recognition , 2012 .

[48]  Haizhou Li,et al.  Jump Function Kolmogorov for overlapping audio event classification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[50]  Yoshitaka Nakajima,et al.  Auditory Scene Analysis: The Perceptual Organization of Sound Albert S. Bregman , 1992 .

[51]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[52]  Samy Bengio,et al.  A Discriminative Approach for the Retrieval of Images from Text Queries , 2006, ECML.

[53]  Barry Arons,et al.  A Review of The Cocktail Party Effect , 1992 .

[54]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[55]  Seiichi Uchida,et al.  A Survey of Elastic Matching Techniques for Handwritten Character Recognition , 2005, IEICE Trans. Inf. Syst..

[56]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[57]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  Derek Hoiem,et al.  SOLAR: sound object localization and retrieval in complex audio environments , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[59]  Andrey Temko,et al.  Fuzzy integral based information fusion for classification of highly confusable non-speech sounds , 2008, Pattern Recognit..

[60]  H. Sompolinsky,et al.  The tempotron: a neuron that learns spike timing–based decisions , 2006, Nature Neuroscience.

[61]  Ben P. Milner,et al.  Acoustic environment classification , 2006, TSLP.

[62]  Jeffrey R Binder,et al.  Human brain regions involved in recognizing environmental sounds. , 2004, Cerebral cortex.

[63]  Ching-Yung Lin,et al.  Healthcare audio event classification using Hidden Markov Models and Hierarchical Hidden Markov Models , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[64]  Ralf Schlüter,et al.  Non-stationary feature extraction for automatic speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Bhiksha Raj,et al.  Spectrographic seam patterns for discriminative word spotting , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[67]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[68]  Michael S. Lewicki,et al.  Efficient coding of natural sounds , 2002, Nature Neuroscience.

[69]  Moncef Gabbouj,et al.  MUVIS: A Content-Based Indexing and Retrieval System for Image and Video Databases , 1999 .

[70]  Andrey Temko,et al.  Acoustic event detection in meeting-room environments , 2009, Pattern Recognit. Lett..

[71]  Daniel P. W. Ellis,et al.  Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[72]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[73]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[74]  S A Shamma,et al.  Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. , 2001, Journal of neurophysiology.

[75]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[76]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[77]  Danijel Skocaj,et al.  Robust recognition and pose determination of 3-D objects using range images in eigenspace approach , 2001, Proceedings Third International Conference on 3-D Digital Imaging and Modeling.

[78]  Benjamin Peter Milner,et al.  Speech recognition in adverse environments , 1994 .

[79]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[80]  Hossein Najaf-Zadeh,et al.  Auditory-inspired sparse representation of audio signals , 2011, Speech Commun..

[81]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[82]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[83]  Monson H. Hayes,et al.  Hidden Markov models for face recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[84]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[85]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[86]  Richard F. Lyon,et al.  On the importance of time—a temporal representation of sound , 1993 .

[87]  Subhransu Maji,et al.  Object detection using a max-margin Hough transform , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[88]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[89]  Ghulam Muhammad,et al.  Environment Recognition from Audio Using MPEG-7 Features , 2009, 2009 Fourth International Conference on Embedded and Multimedia Computing.

[90]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[91]  Noboru Ohnishi,et al.  Building ears for robots: Sound localization and separation , 1997, Artificial Life and Robotics.

[92]  R.D. Dony,et al.  Audio Environment Classication for Hearing Aids using Artificial Neural Networks with Windowed Input , 2007, 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing.

[93]  Andrey Temko,et al.  Classification of meeting-room acoustic events with support vector machines and variable-feature-set clustering , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[94]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[95]  Sylvain Marchand,et al.  THE HOUGH TRANSFORM FOR BINAURAL SOURCE LOCALIZATION , 2009 .

[96]  Jean Paul Haton,et al.  On noise masking for automatic missing data speech recognition: A survey and discussion , 2007, Comput. Speech Lang..

[97]  Thomas C. Walters Auditory-based processing of communication sounds , 2011 .

[98]  Gerhard Rigoll,et al.  Recognition of JPEG compressed face images based on statistical methods , 2000, Image Vis. Comput..

[99]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[100]  Björn W. Schuller,et al.  Semi-supervised learning helps in sound event classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101]  Th. Beth,et al.  ANALYSIS OF DRILL SOUND IN SPINE SURGERY , 2004 .

[102]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[103]  Chng Eng Siong,et al.  Image Feature Representation of the Subband Power Distribution for Robust Sound Event Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[104]  Andrew Zisserman,et al.  A Boundary-Fragment-Model for Object Detection , 2006, ECCV.

[105]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[106]  P. Roth,et al.  SURVEY OF APPEARANCE-BASED METHODS FOR OBJECT RECOGNITION , 2008 .

[107]  Wai C. Chu,et al.  Speech Coding Algorithms , 2003 .

[108]  Tony Ezzat,et al.  Discriminative word-spotting using ordered spectro-temporal patch features , 2008, SAPA@INTERSPEECH.

[109]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[110]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[111]  Isabel Trancoso,et al.  Detecting audio events for semantic video search , 2009, INTERSPEECH.

[112]  Christopher Heil,et al.  Continuous and Discrete Wavelet Transforms , 1989, SIAM Rev..

[113]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[114]  Pierre Divenyi Speech Separation by Humans and Machines , 2004 .

[115]  Hrishikesh Deshpande,et al.  CLASSIFICATION OF MUSIC SIGNALS IN THE VISUAL DOMAIN , 2001 .

[116]  Tao Zhang,et al.  Evaluation of sound classification algorithms for hearing aid applications , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[117]  Haizhou Li,et al.  Image Representation of the Subband Power Distribution for Robust Sound Classification , 2011, INTERSPEECH.

[118]  R. K. Reddy,et al.  Categorization of environmental sounds , 2009, Biological Cybernetics.

[119]  Shu-Yuan Chen,et al.  Image classification using color, texture and regions , 2003, Image Vis. Comput..

[120]  Yali Amit,et al.  Robust acoustic object detection. , 2005, The Journal of the Acoustical Society of America.

[121]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[122]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[123]  Nikos Fakotakis,et al.  On acoustic surveillance of hazardous situations , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[124]  Tony Ezzat,et al.  Localized spectro-temporal cepstral analysis of speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[125]  L.-H. Chen,et al.  Colour image retrieval based on primitives of colour moments , 2002 .

[126]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[127]  Lars Kai Hansen,et al.  Temporal Feature Integration for Music Genre Classification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[128]  Richard E. Turner Statistical models for natural sounds , 2010 .

[129]  Mathieu Lagrange,et al.  Polyphonic Instrument Recognition Using Spectral Clustering , 2007, ISMIR.

[130]  Ning Ma,et al.  Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..

[131]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[132]  S. Govindarajulu,et al.  A Comparison of SIFT, PCA-SIFT and SURF , 2012 .

[133]  E. Coyle,et al.  Onset based audio segmentation for the Irish tin whistle , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[134]  Shumeet Baluja,et al.  Audio Fingerprinting: Combining Computer Vision & Data Stream Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[135]  Michael J. Black,et al.  EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation , 1996, International Journal of Computer Vision.

[136]  Jinhai Cai,et al.  Sensor Network for the Monitoring of Ecosystem: Bird Species Recognition , 2007, 2007 3rd International Conference on Intelligent Sensors, Sensor Networks and Information.

[137]  Richard M. Stern,et al.  Robust Speech Recognition: The case for restoring missing features , 2001 .

[138]  C. H. Chen Recognition of underwater transient patterns , 1985, Pattern Recognit..

[139]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[140]  Guillaume Lemaitre,et al.  Real-Time Detection of Overlapping Sound Events with Non-Negative Matrix Factorization , 2013 .

[141]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[142]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[143]  I. Paraskevas,et al.  Audio classification using acoustic images for retrieval from multimedia databases , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[144]  Kenneth Thomas Schutte,et al.  Parts-based models and local features for automatic speech recognition , 2009 .

[145]  Horst Bischof,et al.  Dealing with occlusions in the eigenspace approach , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[146]  C.-C. Jay Kuo,et al.  Environmental sound recognition using MP-based features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[147]  Nobuyuki Miyake,et al.  Noise Detection and Classification in Speech Signals with Boosting , 2007, 2007 IEEE/SP 14th Workshop on Statistical Signal Processing.

[148]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[149]  Martin Heckmann,et al.  A hierarchical framework for spectro-temporal feature extraction , 2011, Speech Commun..

[150]  Haibo Li,et al.  Simple 1D Discrete Hidden Markov Models for Face Recognition , 2003, VLBV.

[151]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[152]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[153]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[154]  Nikos Fakotakis,et al.  Automatic Recognition of an Unknown and Time-Varying Number of Simultaneous Environmental Sound Sources , 2011 .

[155]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[156]  Richard F. Lyon,et al.  Machine Hearing: An Emerging Field , 2010 .

[157]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[158]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[159]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[160]  Douglas D. O'Shaughnessy,et al.  Invited paper: Automatic speech recognition: History, methods and challenges , 2008, Pattern Recognit..

[161]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[162]  Andrey Temko,et al.  ACOUSTIC EVENT DETECTION AND CLASSIFICATION IN SMART-ROOM ENVIRONMENTS: EVALUATION OF CHIL PROJECT SYSTEMS , 2006 .

[163]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[164]  Miroslaw Bober,et al.  MPEG-7 visual shape descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[165]  Alain Dufaux Detection and Recognition of Impulsive Sound Signals , 2001 .

[166]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[167]  John Midgley,et al.  Probabilistic eigenspace object recognition in the presence of occlusion , 2001 .

[168]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[169]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[170]  A.S.A. Mohamed,et al.  Recognition of heart sounds and murmurs for cardiac diagnosis , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[171]  Cordelia Schmid,et al.  Accurate Object Detection with Deformable Shape Models Learnt from Images , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[172]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, CVPR Workshops.

[173]  Tetsuya Takiguchi,et al.  Gradient-based acoustic features for speech recognition , 2009, 2009 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS).

[174]  Mandy Eberhart,et al.  Speech Communications Human And Machine , 2016 .

[175]  Peter E. Hart,et al.  Experiments in Scene Analysis , 1970 .

[176]  Augusto Sarti,et al.  Scream and gunshot detection in noisy environments , 2007, 2007 15th European Signal Processing Conference.

[177]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[178]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[179]  Takeo Kanade,et al.  Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.

[180]  N. A. Thacker,et al.  Tutorial: Algorithms For 2-Dimensional Object Recognition. , 1996 .

[181]  Gaël Richard,et al.  Temporal Integration for Audio Classification With Application to Musical Instrument Classification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[182]  R. Christopher deCharms,et al.  Primary cortical representation of sounds by the coordination of action-potential timing , 1996, Nature.

[183]  Miguel Á. Carreira-Perpiñán,et al.  Mode-Finding for Mixtures of Gaussian Distributions , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[184]  Renate Sitte,et al.  Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System , 2002 .

[185]  Tetsuya Ogata,et al.  Effects of modelling within- and between-frame temporal variations in power spectra on non-verbal sound recognition , 2010, INTERSPEECH.

[186]  Brian Gygi,et al.  Similarity and categorization of environmental sounds , 2007, Perception & psychophysics.

[187]  Martial Michel,et al.  The CLEAR 2007 Evaluation , 2007, CLEAR.

[188]  F. Beritelli,et al.  A pattern recognition system for environmental sound classification based on MFCCs and neural networks , 2008, 2008 2nd International Conference on Signal Processing and Communication Systems.

[189]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[190]  Haizhou Li,et al.  Normalization of the Speech Modulation Spectra for Robust Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[191]  Andreas Spanias,et al.  Segmentation, Indexing, and Retrieval for Environmental and Natural Sounds , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[192]  G. Mangun,et al.  Tonotopy in human auditory cortex examined with functional magnetic resonance imaging , 1997, Human brain mapping.

[193]  Massimo Minervini,et al.  Nonnegative Matrix Factorizations Performing Object Detection and Localization , 2012, Appl. Comput. Intell. Soft Comput..

[194]  Enzo Mumolo,et al.  Algorithms for acoustic localization based on microphone array in service robotics , 2003, Robotics Auton. Syst..

[195]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[196]  Panu Somervuo,et al.  Parametric Representations of Bird Sounds for Automatic Species Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[197]  Chng Eng Siong,et al.  Overlapping sound event recognition using local spectrogram features and the generalised hough transform , 2013, Pattern Recognit. Lett..

[198]  Taras Butko,et al.  Acoustic Event Detection Based on Feature-Level Fusion of Audio and Video Modalities , 2011, EURASIP J. Adv. Signal Process..

[199]  Haizhou Li,et al.  Temporal coding of local spectrogram features for robust sound recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[200]  Leszek Cieplinski MPEG-7 Color Descriptors and Their Applications , 2001, CAIP.

[201]  Rainer Stiefelhagen,et al.  Computers in the Human Interaction Loop , 2009, Human-Computer Interaction Series.

[202]  Chidchanok Lursinsap,et al.  Impulsive Environment Sound Detection by Neural Classification of Spectrogram and Mel-Frequency Coefficient Images , 2010 .

[203]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[204]  Alfred Mertins,et al.  Analysis and design of gammatone signal models. , 2009, The Journal of the Acoustical Society of America.

[205]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[206]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[207]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[208]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[209]  H. Sompolinsky,et al.  Time-Warp–Invariant Neuronal Processing , 2009, PLoS biology.

[210]  Dan Roth,et al.  Learning to detect objects in images via a sparse, part-based representation , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[211]  Haizhou Li,et al.  Sound Event Recognition With Probabilistic Distance SVMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[212]  Seungjin Choi,et al.  Nonnegative features of spectro-temporal sounds for classification , 2005, Pattern Recognit. Lett..

[213]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[214]  M. Kleinschmidt Methods for capturing spectro-temporal modulations in automatic speech recognition , 2001 .

[215]  Jitendra Malik,et al.  Blobworld: Image Segmentation Using Expectation-Maximization and Its Application to Image Querying , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[216]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[217]  Sridhar Krishnan,et al.  Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[218]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[219]  Andrey Temko,et al.  Acoustic Event Detection and Classification , 2007, Computers in the Human Interaction Loop.

[220]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[221]  J J Hopfield,et al.  What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[222]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[223]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[224]  Jean-Jacques E. Slotine,et al.  Audio classification from time-frequency texture , 2008, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[225]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[226]  Tomi Kinnunen,et al.  Audio context recognition in variable mobile environments from short segments using speaker and language recognizers , 2012, Odyssey.

[227]  Andrey Temko,et al.  CLEAR Evaluation of Acoustic Event Detection and Classification Systems , 2006, CLEAR.

[228]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[229]  Gy Kovács,et al.  Localized spectro-temporal features for noise-robust speech recognition , 2010, 2010 International Joint Conference on Computational Cybernetics and Technical Informatics.

[230]  Andrew Blake,et al.  Multiscale Categorical Object Recognition Using Contour Fragments , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[231]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[232]  David Gerhard,et al.  Audio Signal Classification: History and Current Techniques , 2003 .

[233]  James R. Glass,et al.  Speech recognition with localized time-frequency pattern detectors , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[234]  Haizhou Li,et al.  Selective gammatone filterbank feature for robust sound event recognition , 2010, INTERSPEECH.

[235]  Klaus Obermayer,et al.  Classification Schemes for Step Sounds Based on Gammatone-Filters , 2007, NIPS 2007.

[236]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[237]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[238]  Michael A. Cowling,et al.  Non-Speech Environmental Sound Classification System for Autonomous Surveillance , 2004 .

[239]  David V. Anderson,et al.  Audio classification and scene recognition and for hearing aids , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[240]  Douglas OʼShaughnessy Formant Estimation and Tracking , 2008 .

[241]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[242]  Steve J. Young,et al.  HMM-based architecture for face identification , 1994, Image Vis. Comput..

[243]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[244]  Shao-Hu Peng,et al.  A visual shape descriptor using sectors and shape context of contour lines , 2010, Inf. Sci..

[245]  Luc Van Gool,et al.  Fast PRISM: Branch and Bound Hough Transform for Object Class Detection , 2011, International Journal of Computer Vision.

[246]  R. Meddis Simulation of mechanical to neural transduction in the auditory receptor. , 1986, The Journal of the Acoustical Society of America.

[247]  Jean-Sebastien Legare,et al.  Face Recognition : Robustness of the ‘ Eigenface ’ Approach , 2005 .

[248]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[249]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[250]  Taras Butko,et al.  Feature selection for multimodal: acoustic event detection , 2011 .

[251]  François Pachet,et al.  Exploring Billions of Audio Features , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[252]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[253]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[254]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[255]  Longbiao Wang,et al.  Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM , 2007, Speech Commun..

[256]  Ben Pinkowski A Template-Based Approach for Recognition of Intermittent Sounds , 1989, Great Lakes Computer Science Conference.

[257]  Björn W. Schuller,et al.  Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).