ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
暂无分享,去创建一个
Thomas Breuel | Gunhee Kim | Gal Chechik | Youngjae Yu | Jiwan Chung | Yale Song | Sangho Lee | Gal Chechik | Yale Song | Thomas Breuel | Youngjae Yu | Gunhee Kim | Jiwan Chung | Sangho Lee
[1] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[3] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[4] Trevor Darrell,et al. Probabalistic Models and Informative Subspaces for Audiovisual Correspondence , 2002, ECCV.
[5] Joon Son Chung,et al. Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..
[6] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.
[7] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .
[9] Xin Li,et al. Adaptive Active Learning for Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[10] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.
[11] Jeff A. Bilmes,et al. Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[12] Yusuke Shinohara. A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[13] Mei Wang,et al. Deep Visual Domain Adaptation: A Survey , 2018, Neurocomputing.
[14] Andreas Krause,et al. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.
[15] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[16] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[17] Mihalis Yannakakis,et al. How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).
[18] A. Kraskov,et al. Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.
[19] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.
[20] Kasturi R. Varadarajan,et al. Geometric Approximation via Coresets , 2007 .
[21] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.
[22] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.
[23] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[24] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.
[25] Kristen Grauman,et al. 2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..
[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[28] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.
[29] Michel Minoux,et al. Accelerated greedy algorithms for maximizing submodular set functions , 1978 .
[30] James Bailey,et al. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..
[31] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[32] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[33] Gabriela Csurka,et al. Deep Visual Domain Adaptation , 2020, 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).
[34] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.
[35] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[36] David A. Ross,et al. Learning Video Representations from Textual Web Supervision , 2020, ArXiv.
[37] Burr Settles,et al. Active Learning Literature Survey , 2009 .
[38] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.
[39] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.
[40] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[41] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[42] Jeff A. Bilmes,et al. Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.
[43] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[44] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Matthijs Douze,et al. FastText.zip: Compressing text classification models , 2016, ArXiv.
[46] Jinjun Xiong,et al. Automatic Curation of Sports Highlights Using Multimodal Excitement Features , 2019, IEEE Transactions on Multimedia.
[47] J. R. Landis,et al. The measurement of observer agreement for categorical data. , 1977, Biometrics.
[48] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[49] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[50] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[51] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Jeff A. Bilmes,et al. Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[54] Ivor W. Tsang,et al. Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..
[55] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[56] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[57] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[58] Liam Paninski,et al. Estimation of Entropy and Mutual Information , 2003, Neural Computation.
[59] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[60] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[61] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[62] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .
[63] Rishabh K. Iyer,et al. Submodularity in Data Subset Selection and Active Learning , 2015, ICML.
[64] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Daniel P. W. Ellis,et al. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.
[66] Murat Akçakaya,et al. Classification Active Learning Based on Mutual Information , 2016, Entropy.
[67] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.
[68] Geoffrey Zweig,et al. Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.
[69] Justin Salamon,et al. Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Bolei Zhou,et al. Interpreting Deep Visual Representations via Network Dissection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[71] Yuhong Guo,et al. Active Instance Sampling via Matrix Partition , 2010, NIPS.
[72] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[73] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[74] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[75] Yan Li,et al. Estimation of Mutual Information: A Survey , 2009, RSKT.
[76] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[77] M. Meilă. Comparing clusterings---an information based distance , 2007 .
[78] E. Gibson. Principles of Perceptual Learning and Development , 1969 .