Environmental sound processing and its applications

As part of the effort to develop techniques for understanding environments using sound, many studies in the field of computational auditory scene analysis have focused on using computers to perform functions carried out naturally by the human auditory system. Thanks to recent progress in machine‐learning techniques, these environmental sound‐processing techniques have significantly improved and a widening variety of applications has resulted in considerable interest in this field. In this review, we introduce the fundamental techniques of environmental sound processing, as well as recent advances in front‐end and back‐end processing and potential applications for these techniques. Prospects for further progress in the field of environmental sound processing and the challenges still to be overcome are also discussed. © 2019 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

[1]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[4]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[5]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[6]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[7]  Barry Arons,et al.  A Review of The Cocktail Party Effect , 1992 .

[8]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[9]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[10]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[11]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[12]  Daniel P. W. Ellis,et al.  PREDICTION-DRIVEN COMPUTATIONAL AUDITORY SCENE ANALYSIS FOR DENSE SOUND MIXTURES , 1996 .

[13]  J. Cardoso Infomax and maximum likelihood for blind source separation , 1997, IEEE Signal Processing Letters.

[14]  Richard A. Brown,et al.  Introduction to random signals and applied kalman filtering (3rd ed , 2012 .

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  Toshiyuki Asahi,et al.  Sound retrieval with intuitive verbal expressions , 1998 .

[17]  Tomohiro Nakatani,et al.  Sound Ontology for Computational Auditory Scence Analysis , 1998, AAAI/IAAI.

[18]  Paris Smaragdis,et al.  Blind separation of convolved mixtures in the frequency domain , 1998, Neurocomputing.

[19]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[20]  Fausto Pellandini,et al.  Automatic sound detection and recognition for noisy environment , 2000, 2000 10th European Signal Processing Conference.

[21]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[22]  Kazuya Takeda,et al.  Blind source separation combining frequency-domain ICA and beamforming , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Michael F. Bunting,et al.  The cocktail party phenomenon revisited: The importance of working memory capacity , 2001, Psychonomic bulletin & review.

[24]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  J. Stephen Downie,et al.  Music information retrieval , 2005, Annu. Rev. Inf. Sci. Technol..

[26]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[27]  Shiro Ikeda,et al.  A METHOD OF ICA IN TIME-FREQUENCY DOMAIN , 2003 .

[28]  Kiyohiro Shikano,et al.  Blind Source Separation Combining Independent Component Analysis and Beamforming , 2003, EURASIP J. Adv. Signal Process..

[29]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[30]  D. W. Scott Outlier Detection and Clustering by Partial Mixture Modeling , 2004 .

[31]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[32]  Remco C. Veltkamp,et al.  A Survey of Music Information Retrieval Systems , 2005, ISMIR.

[33]  DeLiang Wang,et al.  A Computational Auditory Scene Analysis System for Robust Speech Recognition , 2022 .

[34]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[35]  Valentin Robu,et al.  The Dynamics and Semantics of Collaborative Tagging , 2006, SAAW@ISWC.

[36]  Te-Won Lee,et al.  Independent Vector Analysis: An Extension of ICA to Multivariate Components , 2006, ICA.

[37]  Hiroshi Sawada,et al.  Blind Extraction of Dominant Target Sources Using ICA and Time-Frequency Masking , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Johannes D. Krijnders,et al.  CASSANDRA: audio-video sensor fusion for aggression detection , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[39]  Andrey Temko,et al.  Acoustic Event Detection: SVM-Based System and Evaluation Setup in CLEAR'07 , 2007, CLEAR.

[40]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[41]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[42]  Te-Won Lee,et al.  Blind Source Separation Exploiting Higher-Order Frequency Dependencies , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[44]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006, IEEE Trans. Neural Networks.

[45]  Asma Rabaoui,et al.  Using One-Class SVMs and Wavelets for Audio Surveillance , 2008, IEEE Transactions on Information Forensics and Security.

[46]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[47]  Hirokazu Kameoka,et al.  Complex NMF: A new sparse representation for acoustic signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[49]  Ching-Yung Lin,et al.  Healthcare audio event classification using Hidden Markov Models and Hierarchical Hidden Markov Models , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[50]  Hiroshi Sawada,et al.  Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with Dirichlet prior , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Nobutaka Ito,et al.  Blind alignment of asynchronously recorded signals for distributed microphone array , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[52]  Hirokazu Kameoka,et al.  Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms , 2010, LVA/ICA.

[53]  Mert Bay,et al.  The Music Information Retrieval Evaluation eXchange: Some Observations and Insights , 2010, Advances in Music Information Retrieval.

[54]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Andrzej Czyzewski,et al.  Dangerous Sound Event Recognition Using Support Vector Machine Classifiers , 2010, MISSI.

[56]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[57]  Cédric Richard,et al.  Abnormal events detection using unsupervised One-Class SVM - Application to audio surveillance and evaluation - , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[58]  Nobutaka Ono,et al.  Auxiliary-function-based independent vector analysis with power of vector-norm type weighting functions , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[59]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Björn W. Schuller,et al.  Large-scale audio feature extraction and SVM for acoustic scene classification , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[61]  Hirokazu Kameoka,et al.  Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Masataka Goto,et al.  Infinite Positive Semidefinite Tensor Factorization for Source Separation of Mixture Signals , 2013, ICML.

[63]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[64]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[65]  Jordi Janer,et al.  Sound Retrieval From Voice Imitation Queries In Collaborative Databases , 2014, Semantic Audio.

[66]  Sungzoon Cho,et al.  Variational Autoencoder based Anomaly Detection using Reconstruction Probability , 2015 .

[67]  Huy Phan,et al.  Random Regression Forests for Acoustic Event Detection and Classification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[68]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[69]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[70]  Gaël Richard,et al.  HOG and subband power distribution image features for acoustic scene classification , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[71]  Dan Stowell,et al.  Detection and Classification of Acoustic Scenes and Events , 2015, IEEE Transactions on Multimedia.

[72]  Toni Heittola,et al.  IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events SOUND EVENT DETECTION FOR OFFICE LIVE AND OFFICE SYNTHETIC AASP CHALLENGE , 2015 .

[73]  Hirokazu Kameoka,et al.  Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[75]  Suehiro Shimauchi,et al.  Acoustic Scene Analysis Based on Hierarchical Generative Model of Acoustic Event Sequence , 2016, IEICE Trans. Inf. Syst..

[76]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[77]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  Justin Salamon,et al.  The Implementation of Low-cost Urban Acoustic Monitoring Devices , 2016, ArXiv.

[79]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Reishi Kondo,et al.  Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[82]  Reishi Kondo,et al.  Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with Mixtures of Local Dictionaries , 2016, DCASE.

[83]  Hirokazu Kameoka,et al.  Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[84]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[86]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[87]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[88]  Tomoki Toda,et al.  Stereophonic music separation based on non-negative tensor factorization with cepstrum regularization , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[89]  Kyogu Lee,et al.  Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input , 2017, DCASE.

[90]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[91]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[93]  Takeshi Yamada,et al.  Ego Noise Reduction for Hose-Shaped Rescue Robot Combining Independent Low-Rank Matrix Analysis and Multichannel Noise Cancellation , 2016, LVA/ICA.

[94]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[95]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[96]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Tomoki Toda,et al.  Duration-Controlled LSTM for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[98]  Nobutaka Ono,et al.  Spatial Cepstrum as a Spatial Feature Using a Distributed Microphone Array for Acoustic Scene Analysis , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[99]  Qiang Huang,et al.  Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging , 2017, INTERSPEECH.

[100]  Mans Hulden,et al.  Sound Analogies with Phoneme Embeddings , 2018 .

[101]  Shinnosuke Takamichi,et al.  Independent Deeply Learned Matrix Analysis for Multichannel Audio Source Separation , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[102]  Kunio Kashino,et al.  Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[103]  Keisuke Imoto,et al.  Introduction to acoustic event and scene analysis , 2018 .

[104]  Tomoki Toda,et al.  Anomalous Sound Event Detection Based on WaveNet , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[105]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[106]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[107]  Li Li,et al.  Semi-blind source separation with multichannel variational autoencoder , 2018, ArXiv.

[108]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[109]  Tomoki Toda,et al.  Connectionist Temporal Classification-based Sound Event Encoder for Converting Sound Events into Onomatopoeic Representations , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[110]  Li Li,et al.  Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).