暂无分享,去创建一个
[1] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[2] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[4] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[5] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[7] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[8] M. Meilă. Comparing clusterings---an information based distance , 2007 .
[9] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[10] Matthijs Douze,et al. FastText.zip: Compressing text classification models , 2016, ArXiv.
[11] Ben Taskar,et al. Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..
[12] Allan Jabri,et al. Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.
[13] A. Atkinson. Subset Selection in Regression , 1992 .
[14] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Mei Wang,et al. Deep Visual Domain Adaptation: A Survey , 2018, Neurocomputing.
[16] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[17] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[18] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[19] David A. Ross,et al. Learning Video Representations from Textual Web Supervision , 2020, ArXiv.
[20] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[21] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.
[22] Andreas Krause,et al. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.
[23] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.
[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[26] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Yan Li,et al. Estimation of Mutual Information: A Survey , 2009, RSKT.
[28] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[29] Jeff A. Bilmes,et al. Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[30] Burr Settles,et al. Active Learning Literature Survey , 2009 .
[31] Ivor W. Tsang,et al. Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..
[32] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Kristen Grauman,et al. 2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..
[35] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .
[36] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[37] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.
[38] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[39] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.
[40] Jeff A. Bilmes,et al. Unsupervised submodular subset selection for speech data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[41] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.
[42] Jeff A. Bilmes,et al. Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.
[43] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[44] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[45] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.
[46] J. R. Landis,et al. The measurement of observer agreement for categorical data. , 1977, Biometrics.
[47] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[48] Laurens van der Maaten,et al. Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[50] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .
[51] Rishabh K. Iyer,et al. Submodularity in Data Subset Selection and Active Learning , 2015, ICML.
[52] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[54] Alexander A. Alemi,et al. On Variational Bounds of Mutual Information , 2019, ICML.
[55] Michel Minoux,et al. Accelerated greedy algorithms for maximizing submodular set functions , 1978 .
[56] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[57] Kasturi R. Varadarajan,et al. Geometric Approximation via Coresets , 2007 .
[58] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[59] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.
[60] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.
[61] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[62] James Bailey,et al. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..
[63] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[64] Xin Li,et al. Adaptive Active Learning for Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[65] Geoffrey Zweig,et al. Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.
[66] Justin Salamon,et al. Telling Left From Right: Learning Spatial Correspondence of Sight and Sound , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Bolei Zhou,et al. Interpreting Deep Visual Representations via Network Dissection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[68] Yuhong Guo,et al. Active Instance Sampling via Matrix Partition , 2010, NIPS.
[69] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[70] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[71] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[72] Joon Son Chung,et al. Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..
[73] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.
[74] Yusuke Shinohara. A submodular optimization approach to sentence set selection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[75] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.
[76] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[77] Mihalis Yannakakis,et al. How easy is local search? , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).
[78] Daniel P. W. Ellis,et al. AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies , 2018, INTERSPEECH.
[79] Murat Akçakaya,et al. Classification Active Learning Based on Mutual Information , 2016, Entropy.
[80] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.