Cross-modal supervised learning for better acoustic representations

Obtaining large-scale human-labeled datasets to train acoustic representation models is a very challenging task. On the contrary, we can easily collect data with machine-generated labels. In this work, we propose to exploit machine-generated labels to learn better acoustic representations, based on the synchronization between vision and audio. Firstly, we collect a large-scale video dataset with 15 million samples, which totally last 16,320 hours. Each video is 3 to 5 seconds in length and annotated automatically by publicly available visual and audio classification models. Secondly, we train various classical convolutional neural networks (CNNs) including VGGish, ResNet 50 and Mobilenet v2. We also make several improvements to VGGish and achieve better results. Finally, we transfer our models on three external standard benchmarks for audio classification task, and achieve significant performance boost over the state-of-the-art results. Models and codes are available at: this https URL.

[1]  Jesús Favela,et al.  Scalable identification of mixed environmental sounds, recorded from heterogeneous sources , 2015, Pattern Recognit. Lett..

[2]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[7]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[9]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Huy Phan,et al.  Representing nonspeech audio signals through speech classification models , 2015, INTERSPEECH.

[11]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[12]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Chin-Hui Lee,et al.  A blind segmentation approach to acoustic event detection based on i-vector , 2013, INTERSPEECH.

[14]  Hemant A. Patil,et al.  Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification , 2017, INTERSPEECH.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Baoyuan Wu,et al.  Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning , 2019, IEEE Access.

[17]  VirtanenTuomas,et al.  Detection and Classification of Acoustic Scenes and Events , 2018 .

[18]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[19]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[22]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[23]  P. Barrouillet,et al.  Time constraints and resource sharing in adults' working memory spans. , 2004, Journal of experimental psychology. General.

[24]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[25]  Yu Tsao,et al.  Sparse representation with temporal max-smoothing for acoustic event detection , 2015, INTERSPEECH.