Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Large-scale datasets are the cornerstone of selfsupervised representation learning. Existing algorithms extract learning signals by making certain assumptions about the data, e.g., spatio-temporal continuity and multimodal correspondence. Unfortunately, finding a large amount of data that satisfies such assumptions is sometimes not straightforward. This restricts the community to rely on datasets that require laborious annotation and/or manual filtering processes. In this paper, we describe a subset optimization approach for automatic dataset curation. Focusing on the scenario of audio-visual representation learning, we pose the problem as finding a subset that maximizes the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that selfsupervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales. The most significant benefit of our approach is scalability. We release the largest video dataset for audio-visual research collected automatically using our approach.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[4]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[7]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[8]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[9]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[10]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[11]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[12]  Allan Jabri,et al.  Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.

[13]  A. Atkinson Subset Selection in Regression , 1992 .

[14]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Mei Wang,et al.  Deep Visual Domain Adaptation: A Survey , 2018, Neurocomputing.

[16]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  David A. Ross,et al.  Learning Video Representations from Textual Web Supervision , 2020, ArXiv.

[20]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[21]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[22]  Andreas Krause,et al.  Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.

[23]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[26]  Serge J. Belongie,et al.  Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yan Li,et al.  Estimation of Mutual Information: A Survey , 2009, RSKT.

[28]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[31]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[32]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kristen Grauman,et al.  2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[35]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[40]  Jeff A. Bilmes,et al.  Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[42]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[43]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[46]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[47]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[48]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[50]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[51]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[52]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[54]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[55]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[56]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[57]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[58]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[59]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[60]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[61]  Allan Jabri,et al.  Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[63]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[64]  Xin Li,et al.  Adaptive Active Learning for Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Geoffrey Zweig,et al.  Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.

[66]  Justin Salamon,et al.  Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Bolei Zhou,et al.  Interpreting Deep Visual Representations via Network Dissection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Yuhong Guo,et al.  Active Instance Sampling via Matrix Partition , 2010, NIPS.

[69]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[73]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[74]  Yusuke Shinohara A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[76]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[77]  Mihalis Yannakakis,et al.  How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[78]  Daniel P. W. Ellis,et al.  AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.

[79]  Murat Akçakaya,et al.  Classification Active Learning Based on Mutual Information , 2016, Entropy.

[80]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.