Crossmodal Sound Retrieval Based on Specific Target Co-Occurrence Denoted with Weak Labels

Recent advancements in representation learning enable crossmodal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in failure in crossmodal retrieval. To overcome this problem, we propose a triplet-loss-based representation learning method that incorporates an awareness mechanism. We adopt weakly-supervised event detection, which provides a constraint in representation learning so that our method can “be aware” of a specific target audio-visual co-occurrence and discriminate it from other non-target co-occurrences. We evaluated the performance of our method by applying it to a sound effect retrieval task using recorded TV broadcast data. In the task, a sound effect appropriate for a given video input should be retrieved. We then conducted objective and subjective evaluations, the results indicating that the proposed method produces significantly better associations of sound and visual effects than baselines with no awareness mechanism.

[1]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[2]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[3]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[4]  James R. Glass,et al.  Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio , 2019, INTERSPEECH.

[5]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[7]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[8]  Michael Picheny,et al.  Grounding Spoken Words in Unlabeled Video , 2019, CVPR Workshops.

[9]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[10]  Ryo Masumura,et al.  Context-Aware Neural Voice Activity Detection Using Auxiliary Networks for Phoneme Recognition, Speech Enhancement and Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[11]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[12]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[15]  Karen Livescu,et al.  Semantic Query-by-example Speech Search Using Visual Grounding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  James Glass,et al.  Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.

[17]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[20]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[21]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[22]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[23]  Chuang Gan,et al.  Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Suh-Yin Lee,et al.  Background music recommendation for video based on multimodal latent semantic analysis , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Yuuki Tachioka Dnn-Based Voice Activity Detection Using Auxiliary Speech Models in Noisy Environments , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[27]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[29]  Kunio Kashino,et al.  Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[31]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  W. Marsden I and J , 2012 .

[34]  Yong Xu,et al.  Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).