An overview of applications and advancements in automatic sound recognition

Automatic sound recognition (ASR) has attracted increased and wide ranging interests in recent years. In this paper, we carry out a review of some important contributions in ASR techniques, mainly over the last one and a half decades. Similar to speech recognition systems, the robustness of an ASR system largely depends on the choice of feature(s) and classifier(s). We take a wider perspective in providing an overview of the features and classifiers used in ASR systems starting from early works in content-based audio classification to more recent developments in applications such as sound event recognition, audio surveillance, and environmental sound recognition. We also review techniques that have been utilized in noise robust sound recognition systems and feature optimization methods. Finally, some of the less commonly known applications of ASR are discussed.

[1]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[2]  Andrzej Czyzewski,et al.  Audio-Visual Surveillance System for Application in Bank Operating Room , 2013, MCSS.

[3]  Diego H. Milone,et al.  Automatic recognition of ingestive sounds of cattle based on hidden Markov models , 2012, Computers and Electronics in Agriculture.

[4]  Heikki Huttunen,et al.  Recognition of acoustic events using deep neural networks , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[5]  Mohammad Bagher Menhaj,et al.  Training feedforward networks with the Marquardt algorithm , 1994, IEEE Trans. Neural Networks.

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  Jingyu Wang,et al.  Salient environmental sound detection framework for machine awareness , 2015, Neurocomputing.

[8]  Wai Lok Woo,et al.  Wearable Audio Monitoring: Content-Based Processing Methodology and Implementation , 2014, IEEE Transactions on Human-Machine Systems.

[9]  H. Jaafar,et al.  Automatic syllables segmentation for frog identification system , 2013, 2013 IEEE 9th International Colloquium on Signal Processing and its Applications.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Tom J. Moir,et al.  Noise robust audio surveillance using reduced spectrogram image feature and one-against-all SVM , 2015, Neurocomputing.

[12]  George Kalliris,et al.  Bowel-sound pattern analysis using wavelets and neural networks with application to long-term, unsupervised, gastrointestinal motility monitoring , 2008, Expert Syst. Appl..

[13]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[14]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[15]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[16]  John H. L. Hansen,et al.  Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition , 2001, INTERSPEECH.

[17]  Zbigniew W. Ras,et al.  Multi-way Hierarchic Classification of Musical Instrument Sounds , 2007, 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE'07).

[18]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[20]  Luiz Eduardo Soares de Oliveira,et al.  Selection of Training Instances for Music Genre Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  Luis Alejandro Sánchez-Pérez,et al.  Aircraft take-off noises classification based on human auditory’s matched features extraction , 2014 .

[22]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[23]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[26]  D. D. Greenwood A cochlear frequency-position function for several species--29 years later. , 1990, The Journal of the Acoustical Society of America.

[27]  Yang Guang,et al.  Matching-Pursuit-Based Adaptive Wavelet-Packet Atomic Decomposition Applied in Ultrasonic Inspection* , 2007 .

[28]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[29]  Arivazhagan Selvaraj,et al.  Texture classification using wavelet transform , 2003, Pattern Recognit. Lett..

[30]  Christian Wellekens,et al.  On desensitizing the Mel-cepstrum to spurious spectral components for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[31]  Yi Liu,et al.  One-against-all multi-class SVM classification using reliability measures , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[32]  Delia Mitrea,et al.  Texture based characterization and automatic diagnosis of the abdominal tumors from ultrasound images using third order GLCM features , 2011, 2011 4th International Congress on Image and Signal Processing.

[33]  Pedro Antonio Gutiérrez,et al.  Ensembles of evolutionary product unit or RBF neural networks for the identification of sound for pass-by noise test in vehicles , 2013, Neurocomputing.

[34]  Reza Sabzevari,et al.  Improvement of learning algorithms for RBF neural networks in a helicopter sound identification system , 2007, Neurocomputing.

[35]  Banshidhar Majhi,et al.  Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer , 2015, Neurocomputing.

[36]  S. Viazzi,et al.  A novel method to automatically measure the feed intake of broiler chickens by sound technology , 2014 .

[37]  Paris Smaragdis,et al.  Hidden Markov and Gaussian mixture models for automatic call classification. , 2009, The Journal of the Acoustical Society of America.

[38]  Ying Li,et al.  Environmental Sound Recognition Using Double-Level Energy Detection , 2013 .

[39]  Alessandro L. Koerich,et al.  The Latin Music Database , 2008, ISMIR.

[40]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[41]  Madan Gopal,et al.  Reduced one-against-all method for multiclass SVM classification , 2011, Expert Syst. Appl..

[42]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Bin Guo,et al.  Social Activity Recognition and Recommendation Based on Mobile Sound Sensing , 2013, 2013 IEEE 10th International Conference on Ubiquitous Intelligence and Computing and 2013 IEEE 10th International Conference on Autonomic and Trusted Computing.

[44]  Insu Song,et al.  Content-based classification of breath sound with enhanced features , 2014, Neurocomputing.

[45]  Jaakko Astola,et al.  Audio based solutions for detecting intruders in wild areas , 2012, Signal Process..

[46]  Manuel Rosa-Zurera,et al.  Transient modeling by matching pursuits with a wavelet dictionary for parametric audio coding , 2004, IEEE Signal Processing Letters.

[47]  Patrice Alexandre,et al.  Root cepstral analysis: A unified view. Application to speech processing in car noise environments , 1993, Speech Commun..

[48]  S. K. Tasoulis,et al.  Statistical data mining of streaming motion data for activity and fall recognition in assistive environments , 2013, Neurocomputing.

[49]  Waleed H. Abdulla,et al.  Performance Evaluation of Front-end Processing for Speech Recognition Systems , 2005 .

[50]  Antti Eronen,et al.  Automatic musical instrument recognition , 2001 .

[51]  Tom J. Moir,et al.  Robust audio surveillance using spectrogram image texture feature , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[53]  A. Ayatollahi,et al.  Comparing Gaussian and chirplet dictionaries for time-frequency analysis using matching pursuit decomposition , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[54]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[55]  Jean-Marie Aerts,et al.  Original papers: Real-time recognition of sick pig cough sounds , 2008 .

[56]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[57]  Luiz S. Oliveira,et al.  Music genre recognition using spectrograms , 2011, 2011 18th International Conference on Systems, Signals and Image Processing.

[58]  Francesco Beritelli,et al.  Human identity verification based on Mel frequency analysis of digital heart sounds , 2009, 2009 16th International Conference on Digital Signal Processing.

[59]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[60]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[61]  Guodong Guo,et al.  Content-based audio classification and retrieval by support vector machines , 2003, IEEE Trans. Neural Networks.

[62]  M. Tabacchi,et al.  A statistical pattern recognition approach for the classification of cooking stages. The boiling water case , 2013 .

[63]  György Fazekas,et al.  Automatic Ontology Generation for Musical Instruments Based on Audio Analysis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Stan Z. Li,et al.  Content-based audio classification and retrieval using the nearest feature line method , 2000, IEEE Trans. Speech Audio Process..

[65]  Jhing-Fa Wang,et al.  Robust Environmental Sound Recognition for Home Automation , 2008, IEEE Transactions on Automation Science and Engineering.

[66]  Andrzej Czyzewski,et al.  Multimodal Audio-Visual Recognition of Traffic Events , 2011, 2011 22nd International Workshop on Database and Expert Systems Applications.

[67]  Hanseok Ko,et al.  Acoustic and visual signal based context awareness system for mobile application , 2011, IEEE Transactions on Consumer Electronics.

[68]  Boonserm Kijsirikul,et al.  Adaptive Directed Acyclic Graphs for Multiclass Classification , 2002, PRICAI.

[69]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[70]  B. Fei,et al.  Binary tree of SVM: a new fast multiclass training and classification algorithm , 2006, IEEE Transactions on Neural Networks.

[71]  Xiaowei Yang,et al.  The one-against-all partition based binary tree support vector machine algorithms for multi-class classification , 2013, Neurocomputing.

[72]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[73]  Hendrik Purwins,et al.  Sparse Approximations for Drum Sound Classification , 2011, IEEE Journal of Selected Topics in Signal Processing.

[74]  Stan Z. Li,et al.  Face recognition using the nearest feature line method , 1999, IEEE Trans. Neural Networks.

[75]  Monique Thonnat,et al.  Audio-Video Event Recognition System for Public Transport Security , 2006 .

[76]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[77]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[78]  Adrian D. C. Chan,et al.  Security monitoring using microphone arrays and audio classification , 2006, IEEE Transactions on Instrumentation and Measurement.

[79]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[80]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[81]  Zvi Kons,et al.  Audio event classification using deep neural networks , 2013, INTERSPEECH.

[82]  Michael S. Lewicki,et al.  Efficient Coding of Time-Relative Structure Using Spikes , 2005, Neural Computation.

[83]  Tong Feng,et al.  Application of evolutionary neural network in impact acoustics based nondestructive inspection of tile-wall , 2005, Proceedings. 2005 International Conference on Communications, Circuits and Systems, 2005..

[84]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[85]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[86]  Zhu Le-Qing,et al.  Insect Sound Recognition Based on MFCC and PNN , 2011, 2011 International Conference on Multimedia and Signal Processing.

[87]  Bin Gao,et al.  Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation. , 2014, The Journal of the Acoustical Society of America.

[88]  Gaël Richard,et al.  ENST-Drums: an extensive audio-visual database for drum signals processing , 2006, ISMIR.

[89]  Fernando Pérez-Cruz,et al.  Enhancing genetic feature selection through restricted search and Walsh analysis , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[90]  Gaël Richard,et al.  Musical instrument recognition by pairwise classification strategies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[91]  Kuldip K. Paliwal,et al.  Spectral subband centroid features for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[92]  Yang Peng,et al.  Audio sensors fusion based on vote for robot navigation , 2013, 2013 25th Chinese Control and Decision Conference (CCDC).

[93]  Michel Vacher,et al.  Information extraction from sound for medical telemonitoring , 2006, IEEE Transactions on Information Technology in Biomedicine.

[94]  Yuan Yan Tang,et al.  Recognizing complex events in real movies by combining audio and video features , 2014, Neurocomputing.

[95]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[96]  Shuang Liu,et al.  An Improved DAG-SVM for Multi-class Classification , 2009, 2009 Fifth International Conference on Natural Computation.

[97]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[98]  Enrique Alexandre,et al.  Feature Selection for Sound Classification in Hearing Aids Through Restricted Search Driven by Genetic Algorithms , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[99]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[100]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[101]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[102]  Jérôme Louradour,et al.  Audio Events Detection in Public Transport Vehicle , 2006, 2006 IEEE Intelligent Transportation Systems Conference.

[103]  Jinhai Cai,et al.  Sensor Network for the Monitoring of Ecosystem: Bird Species Recognition , 2007, 2007 3rd International Conference on Intelligent Sensors, Sensor Networks and Information.

[104]  Rémi Gribonval,et al.  Fast matching pursuit with a multiscale dictionary of Gaussian chirps , 2001, IEEE Trans. Signal Process..

[105]  Oh-Wook Kwon,et al.  Cardiac disorder classification by heart sound signals using murmur likelihood and hidden Markov model state likelihood , 2012, IET Signal Process..

[106]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[107]  Christian Breiteneder,et al.  Features for Content-Based Audio Retrieval , 2010, Adv. Comput..

[108]  Ichiro Fujinaga,et al.  Machine recognition of timbre using steady-state tone of acoustic musical instruments , 1998, ICMC.

[109]  Keikichi Hirose,et al.  Spectrogram based features selection using multiple kernel learning for speech/music discrimination , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[110]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[111]  Alaa Eleyan,et al.  Co-occurrence matrix and its statistical features as a new approach for face recognition , 2011, Turkish Journal of Electrical Engineering and Computer Sciences.

[112]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[113]  Asma Rabaoui,et al.  Using One-Class SVMs and Wavelets for Audio Surveillance , 2008, IEEE Transactions on Information Forensics and Security.

[114]  Jhing-Fa Wang,et al.  Environmental Sound Classification using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level Descriptor , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[115]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[116]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[117]  C.-C. Jay Kuo,et al.  Environmental sound recognition using MP-based features , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[118]  M. Chmulik,et al.  Bio-inspired optimization of acoustic features for generic sound recognition , 2012, 2012 19th International Conference on Systems, Signals and Image Processing (IWSSIP).

[119]  Tom J. Moir,et al.  Subband Time-Frequency Image Texture Features for Robust Audio Surveillance , 2015, IEEE Transactions on Information Forensics and Security.