Vector-based Representation and Clustering of Audio Using Onomatopoeia Words

We present results on organization of audio data based on their descriptions using onomatopoeia words. Onomatopoeia words are imitative of sounds that directly describe and rep resent different types of sound sources through their perceiv ed properties. For instance, the word pop aptly describes the sound of opening a champagne bottle. We first establish this type of audio-to-word relationship by manually tagging a variety of audio clips from a sound effects library with onomatopoeia words. Using principal component analysis (PCA) and a newly proposed distance metric for word-level clustering, we cluster the audio data representing the clips. Due to the distance metric and the audio-to-word relationship, th e resulting clusters of clips have similar acoustic properties . We found that as language level units, the onomatopoeic descriptions are able to represent perceived properties of audio si gnals. We believe that this form of description can be useful in relating higher-level descriptions of events in a sce ne by providing an intermediate perceptual understanding of t he acoustic event.