Investigating audio data visualization for interactive sound recognition

Interactive machine learning techniques have a great potential to personalize media recognition models for each individual user by letting them browse and annotate a large amount of training data. However, graphical user interfaces (GUIs) for interactive machine learning have been mainly investigated in image and text recognition scenarios, not in other data modalities such as sound. In a scenario where users browse a large amount of audio files to search and annotate target samples corresponding to their own sound recognition classes, it is difficult for them to easily navigate through the overall sample structure due to the non-visual nature of audio data. In this work, we investigate the design issue for interactive sound recognition by comparing different visualization techniques ranging from audio spectrograms to deep learning-based audio-to-image retrieval. Based on an analysis of the user study, we clarify the advantages and disadvantages of audio visualization techniques, and provide design implications for interactive sound recognition GUIs using a massive amount of audio samples.

[1]  Richard E. Ladner,et al.  A Personalizable Mobile Sound Detector App Design for Deaf and Hard-of-Hearing Users , 2016, ASSETS.

[2]  Bryan Pardo,et al.  I-SED: An Interactive Sound Event Detector , 2017, IUI.

[3]  David Maxwell Chickering,et al.  ModelTracker: Redesigning Performance Analysis Tools for Machine Learning , 2015, CHI.

[4]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[5]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Carla E. Brodley,et al.  Deploying an interactive machine learning system in an evidence-based practice center: abstrackr , 2012, IHI '12.

[7]  Desney S. Tan,et al.  Interactive optimization for steering machine classification , 2010, CHI.

[8]  Dragutin Petkovic,et al.  Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review , 1996 .

[9]  Chris North,et al.  Semantic interaction for visual text analytics , 2012, CHI.

[10]  Kanika Garg,et al.  A comparative study of noise reduction techniques for automatic speech recognition systems , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[11]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[12]  Vivek Miglani,et al.  Application of semi-supervised deep learning to lung sound analysis , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[13]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[14]  Wai-Tat Fu,et al.  Leveraging the crowd to improve feature-sentiment analysis of user reviews , 2013, IUI '13.

[15]  Artëm Yankov,et al.  Sharkzor: Interactive Deep Learning for Image Triage, Sort and Summary , 2018, ArXiv.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Desney S. Tan,et al.  EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers , 2009, CHI.

[18]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[19]  Alex Endert,et al.  Adding Semantic Information into Data Models by Learning Domain Expertise from User Interaction , 2016, ArXiv.

[20]  Todor Ganchev,et al.  Bird acoustic activity detection based on morphological filtering of the spectrogram , 2015 .

[21]  Shun-Po Chuang,et al.  Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Desney S. Tan,et al.  CueFlik: interactive concept learning in image search , 2008, CHI.

[23]  Jerry Alan Fails,et al.  Interactive machine learning , 2003, IUI '03.

[24]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[25]  Olivier Buisson,et al.  An Interactive System for Electro-Acoustic Music Analysis , 2011, ISMIR.

[26]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[27]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[28]  Frank Kurth,et al.  Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring , 2010, Pattern Recognit. Lett..

[29]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[30]  Daniel J. Mennill,et al.  Comparison of manual and automated methods for identifying target sounds in audio recordings of Pileated, Pale-billed, and putative Ivory-billed woodpeckers , 2009 .

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Oded Nov,et al.  Seeing Sound , 2017, Proc. ACM Hum. Comput. Interact..

[33]  S. Hart,et al.  Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[34]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[35]  Antonio Torralba,et al.  See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[36]  Kristin Branson,et al.  JAABA: interactive machine learning for automatic annotation of animal behavior , 2013, Nature Methods.

[37]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[38]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Rosane Minghim,et al.  Interactive Document Clustering Revisited: A Visual Analytics Approach , 2018, IUI.

[40]  Ratul Mahajan,et al.  CueT: human-guided fast and accurate network alarm triage , 2011, CHI.

[41]  Tuomas Virtanen,et al.  Active learning for sound event classification by clustering unlabeled data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Martin Halvey,et al.  ViGOR: a grouping oriented interface for search and retrieval in video libraries , 2009, JCDL '09.

[43]  Rainer Brück,et al.  Design and evaluation of a smartphone application for non-speech sound awareness for people with hearing loss , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[44]  Jason Freeman,et al.  Towards a Hybrid Recommendation System for a Sound Library , 2019, IUI Workshops.

[45]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47]  Feng Zhou,et al.  Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning with Humans in the Loop , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).