Seeing Sound

Audio annotation is key to developing machine-listening systems; yet, effective ways to accurately and rapidly obtain crowdsourced audio annotations is understudied. In this work, we seek to quantify the reliability/redundancy trade-off in crowdsourced soundscape annotation, investigate how visualizations affect accuracy and efficiency, and characterize how performance varies as a function of audio characteristics. Using a controlled experiment, we varied sound visualizations and the complexity of soundscapes presented to human annotators. Results show that more complex audio scenes result in lower annotator agreement, and spectrogram visualizations are superior in producing higher quality annotations at lower cost of time and human labor. We also found recall is more affected than precision by soundscape complexity, and mistakes can be often attributed to certain sound event characteristics. These findings have implications not only for how we should design annotation tasks and interfaces for audio data, but also how we train and evaluate machine-listening systems.

[1]  Scott P. Robertson,et al.  Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 1991 .

[2]  Bryan Pardo,et al.  VocalSketch: Vocally Imitating Audio Concepts , 2015, CHI.

[3]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[4]  Ali Farhadi,et al.  Much Ado About Time: Exhaustive Annotation of Temporal Data , 2016, HCOMP.

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Paul Roe,et al.  Rapid Scanning of Spectrograms for Efficient Identification of Bioacoustic Events in Big Data , 2013, 2013 IEEE 9th International Conference on e-Science.

[7]  Bryan Pardo,et al.  I-SED: An Interactive Sound Event Detector , 2017, IUI.

[8]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[9]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[10]  Lior Shamir,et al.  Classification of large acoustic datasets using machine learning and crowdsourcing: application to whale calls. , 2014, The Journal of the Acoustical Society of America.

[11]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[13]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[14]  J. J. Higgins,et al.  The aligned rank transform for nonparametric factorial analyses using only anova procedures , 2011, CHI.

[15]  Krzysztof Z. Gajos,et al.  Curio: A Platform for Supporting Mixed-Expertise Crowdsourcing , 2013, HCOMP.

[16]  Mark Levy Improving Perceptual Tempo Estimation with Crowd-Sourced Annotations , 2011, ISMIR.

[17]  Justin Salamon,et al.  The Implementation of Low-cost Urban Acoustic Monitoring Devices , 2016, ArXiv.

[18]  Thomas Fillon,et al.  Telemeta: An open-source web framework for ethnomusicological audio archives management and automatic analysis , 2014, DLfM '14.

[19]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[20]  Michael S. Bernstein,et al.  Embracing Error to Enable Rapid Crowdsourcing , 2016, CHI.

[21]  Maxine Eskénazi,et al.  Speaking to the Crowd: Looking at Past Achievements in Using Crowdsourcing for Speech and Predicting Future Challenges , 2011, INTERSPEECH.

[22]  Daniel J. Mennill,et al.  Comparison of manual and automated methods for identifying target sounds in audio recordings of Pileated, Pale-billed, and putative Ivory-billed woodpeckers , 2009 .

[23]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[24]  Brandon G. Morton,et al.  A Comparative Study of Collaborative vs. Traditional Musical Mood Annotation , 2011, ISMIR.

[25]  Daniel Ichbiah,et al.  Pro Tools , 2002 .

[26]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[27]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[28]  Jin Ha Lee,et al.  Crowdsourcing Music Similarity Judgments using Mechanical Turk , 2010, ISMIR.

[29]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[30]  Gautham J. Mysore,et al.  Fast and easy crowdsourced perceptual audio evaluation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Olivier Buisson,et al.  An Interactive System for Electro-Acoustic Music Analysis , 2011, ISMIR.

[32]  Nicolai Petkov,et al.  Reliable detection of audio events in highly noisy environments , 2015, Pattern Recognit. Lett..

[33]  Birger Kollmeier,et al.  On the use of spectro-temporal features for the IEEE AASP challenge ‘detection and classification of acoustic scenes and events’ , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[34]  Edna Schechtman,et al.  The Gini Methodology: A Primer on a Statistical Methodology , 2012 .

[35]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[36]  Danai Koutra,et al.  Glance: rapidly coding behavioral video with the crowd , 2014, UIST.

[37]  Daniel S. Park,et al.  CrowdCurio: an online crowdsourcing platform to facilitate climate change studies using herbarium specimens. , 2017, The New phytologist.

[38]  Thomas S. Huang,et al.  Improving faster-than-real-time human acoustic event detection by saliency-maximized audio visualization , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Krzysztof Z. Gajos,et al.  Crowdsourcing step-by-step information extraction to enhance existing how-to videos , 2014, CHI.

[40]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Edith Law,et al.  Input-agreement: a new mechanism for collecting data using human computation games , 2009, CHI.

[42]  Ece Kamar,et al.  Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets , 2017, CHI.

[43]  Bryan Pardo,et al.  Social-EQ: Crowdsourcing an Equalization Descriptor Map , 2013, ISMIR.

[44]  Paul Roe,et al.  Large Scale Participatory Acoustic Sensor Data Analysis: Tools and Reputation Models to Enhance Effectiveness , 2011, 2011 IEEE Seventh International Conference on eScience.

[45]  Jeffrey Heer,et al.  Strategies for crowdsourcing social data analysis , 2012, CHI.