A Human-in-the-Loop System for Sound Event Detection and Annotation

Labeling of audio events is essential for many tasks. However, finding sound events and labeling them within a long audio file is tedious and time-consuming. In cases where there is very little labeled data (e.g., a single labeled example), it is often not feasible to train an automatic labeler because many techniques (e.g., deep learning) require a large number of human-labeled training examples. Also, fully automated labeling may not show sufficient agreement with human labeling for many uses. To solve this issue, we present a human-in-the-loop sound labeling system that helps a user quickly label target sound events in a long audio. It lets a user reduce the time required to label a long audio file (e.g., 20 hours) containing target sounds that are sparsely distributed throughout the recording (10% or less of the audio contains the target) when there are too few labeled examples (e.g., one) to train a state-of-the-art machine audio labeling system. To evaluate the effectiveness of our tool, we performed a human-subject study. The results show that it helped participants label target sound events twice as fast as labeling them manually. In addition to measuring the overall performance of the proposed system, we also measure interaction overhead and machine accuracy, which are two key factors that determine the overall performance. The analysis shows that an ideal interface that does not have interaction overhead at all could speed labeling by as much as a factor of four.

[1]  Christian Biemann,et al.  Interactive and Iterative Annotation for Biomedical Entity Recognition , 2015, BIH.

[2]  Rony Kubat,et al.  Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora , 2007, ICMI '07.

[3]  Jean-Philippe Thiran,et al.  Musical Audio Source Separation Based on User-Selected F0 Track , 2012, LVA/ICA.

[4]  Dima Ruinskiy,et al.  A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation , 2009, EURASIP J. Audio Speech Music. Process..

[5]  P. Karsmakers,et al.  AN MFCC-GMM APPROACH FOR EVENT DETECTION AND CLASSIFICATION , 2013 .

[6]  Alexey Ozerov,et al.  Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[8]  Kristen Grauman,et al.  Large-scale live active learning: Training object detectors with crawled data and crowds , 2011, CVPR.

[9]  Chin-Chuan Han,et al.  Automatic Classification of Bird Species From Their Sounds Using Two-Dimensional Cepstral Coefficients , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  João Paulo da Silva Neto,et al.  Non-speech audio event detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[12]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  X. Rodet,et al.  Sound Analysis and Processing with AudioSculpt 2 , 2004, ICMC.

[14]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[15]  Martín Haro,et al.  Scalability, Generality and Temporal Aspects in Automatic Recognition of Predominant Musical Instruments in Polyphonic Music , 2009, ISMIR.

[16]  Umit Yapanel,et al.  Reliability of the LENA Language Environment Analysis System in Young Children’s Natural Home Environment , 2009 .

[17]  D. B. Orr,et al.  TRAINABILITY OF LISTENING COMPREHENSION OF SPEEDED DISCOURSE. , 1965, Journal of educational psychology.

[18]  Giorgio Giacinto,et al.  A nearest-neighbor approach to relevance feedback in content based image retrieval , 2007, CIVR '07.

[19]  Perry R. Cook,et al.  Real-time human interaction with supervised learning algorithms for music composition and performance , 2011 .

[20]  Desney S. Tan,et al.  CueFlik: interactive concept learning in image search , 2008, CHI.

[21]  Teresa Chambel,et al.  SoundsLike: movies soundtrack browsing and labeling based on relevance feedback and gamification , 2013, EuroITV.

[22]  R. Lienhart,et al.  Audio brush: a tool for computer-assisted smart audio editing , 2006, AMCMM '06.

[23]  Bart Thomee,et al.  Interactive search in image retrieval: a survey , 2012, International Journal of Multimedia Information Retrieval.

[24]  Desney S. Tan,et al.  EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers , 2009, CHI.

[25]  Ichiro Fujinaga,et al.  Extending Audacity for Audio Annotation , 2006, ISMIR.

[26]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[27]  Xiangyang Wang,et al.  Active SVM-based relevance feedback using multiple classifiers ensemble and features reweighting , 2013, Eng. Appl. Artif. Intell..

[28]  Mark B. Sandler,et al.  The Sonic Visualiser: A Visualisation Platform for Semantic Descriptors from Musical Signals , 2006, ISMIR.

[29]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Kristen Grauman,et al.  Multi-Level Active Prediction of Useful Image Annotations for Recognition , 2008, NIPS.

[31]  Olivier Buisson,et al.  An Interactive System for Electro-Acoustic Music Analysis , 2011, ISMIR.

[32]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[33]  Desney S. Tan,et al.  Examining multiple potential models in end-user interactive concept learning , 2010, CHI.

[34]  Niels Bogaards,et al.  Introducing Asannotation: a Tool for sound Analysis and Annotation , 2008, ICMC.

[35]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[36]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bryan Pardo,et al.  I-SED: An Interactive Sound Event Detector , 2017, IUI.

[38]  Gautham J. Mysore,et al.  ISSE: an interactive source separation editor , 2014, CHI.

[39]  Ratul Mahajan,et al.  CueT: human-guided fast and accurate network alarm triage , 2011, CHI.

[40]  Yimin Wu,et al.  A feature re-weighting approach for relevance feedback in image retrieval , 2002, Proceedings. International Conference on Image Processing.