Semantic Labeling of Nonspeech Audio Clips

Human communication about entities and events is primarily linguistic in nature. While visual representations of information are shown to be highly effective as well, relatively little is known about the communicative power of auditory nonlinguistic representations. We created a collection of short nonlinguistic auditory clips encoding familiar human activities, objects, animals, natural phenomena, machinery, and social scenes. We presented these sounds to a broad spectrum of anonymous human workers using Amazon Mechanical Turk and collected verbal sound labels. We analyzed the human labels in terms of their lexical and semantic properties to ascertain that the audio clips do evoke the information suggested by their pre-defined captions. We then measured the agreement with the semantically compatible labels for each sound clip. Finally, we examined which kinds of entities and events, when captured by nonlinguistic acoustic clips, appear to be well-suited to elicit information for communication, and which ones are less discriminable. Our work is set against the broader goal of creating resources that facilitate communication for people with some types of language loss. Furthermore, our data should prove useful for future research in machine analysis/synthesis of audio, such as computational auditory scene analysis, and annotating/querying large collections of sound effects.

[1]  M. Marcell,et al.  Confrontation Naming of Environmental Sounds , 2000, Journal of clinical and experimental neuropsychology.

[2]  G. Miller,et al.  A Semantic Network of English Verbs , 1998 .

[3]  Perry R. Cook,et al.  How well do visual verbs work in daily communication for young and old adults? , 2009, CHI.

[4]  Ayse Pinar Saygin,et al.  The effects of linguistic mediation on the identification of environmental sounds , 2002 .

[5]  S. Shimojo,et al.  When Sound Affects Vision: Effects of Auditory Grouping on Visual Motion Perception , 2001, Psychological science.

[6]  R. Fay,et al.  Auditory perception of sound sources , 2007 .

[7]  Davide Rocchesso,et al.  The Sounding Object , 2002 .

[8]  Stephanie Clarke,et al.  Non-verbal auditory recognition in normal subjects and brain-damaged patients: Evidence for parallel processing , 1996, Neuropsychologia.

[9]  Henrik Danielsson,et al.  Pictures as language , 2001 .

[10]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[11]  F. Dick,et al.  Neural resources for processing language and environmental sounds: evidence from aphasia. , 2003, Brain : a journal of neurology.

[12]  Sik Hung Ng,et al.  Power in Language: Verbal Communication and Social Influence , 1993 .

[13]  Stephen A. Brewster,et al.  An evaluation of earcons for use in auditory human-computer interfaces , 1993, INTERCHI.

[14]  Liming Chen,et al.  A general audio classifier based on human perception motivated model , 2007, Multimedia Tools and Applications.

[15]  Meera Blattner,et al.  Earcons and Icons: Their Structure and Common Design Principles , 1989, Hum. Comput. Interact..

[16]  D. Benson,et al.  Non-verbal environmental sound recognition after unilateral hemispheric stroke. , 1994, Brain : a journal of neurology.

[17]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .