Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
暂无分享,去创建一个
Adrian S. Wong | Peter R. Florence | Vincent Vanhoucke | M. Ryoo | V. Sindhwani | Andy Zeng | K. Choromanski | Stefan Welker | Johnny Lee | Aveek Purohit | F. Tombari | Vikas Sindhwani
[1] Dani Yogatama,et al. Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.
[2] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[3] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.
[4] Xiansheng Hua,et al. Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.
[5] Li Dong,et al. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment , 2022, ACL.
[6] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[7] Pratul P. Srinivasan,et al. Block-NeRF: Scalable Large Scene Neural View Synthesis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Ankur Bapna,et al. mSLAM: Massively multilingual joint pre-training for speech and text , 2022, ArXiv.
[9] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[10] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.
[11] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.
[12] P. Abbeel,et al. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.
[13] Yejin Choi,et al. MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Yejin Choi,et al. Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer , 2021, NAACL.
[15] Lior Wolf,et al. ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18] James M. Rehg,et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] M. Ryoo,et al. Hybrid Random Features , 2021, ICLR.
[20] Zhe Gan,et al. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.
[21] Jong Wook Kim,et al. Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Ron Mokady,et al. ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.
[23] Zijian Gao,et al. CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval , 2021, ArXiv.
[24] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[25] Dieter Fox,et al. CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.
[26] Jan Leike,et al. Recursively Summarizing Books with Human Feedback , 2021, ArXiv.
[27] Jason Baldridge,et al. MURAL: Multimodal, Multitask Retrieval Across Languages , 2021, ArXiv.
[28] Fan Yang,et al. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.
[29] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[30] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[31] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[32] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.
[33] Oriol Vinyals,et al. Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.
[34] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[35] Jonathan Tompson,et al. XIRL: Cross-embodiment Inverse Reinforcement Learning , 2021, CoRL.
[36] Zeynep Akata,et al. Audio Retrieval with Natural Language Queries , 2021, Interspeech.
[37] Yin Cui,et al. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.
[38] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[40] K. Grauman,et al. Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[42] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[43] Jes'us Andr'es Portillo-Quintero,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.
[44] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[45] Ioannis Patras,et al. Video Summarization Using Deep Neural Networks: A Survey , 2021, Proceedings of the IEEE.
[46] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[47] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[48] Rethinking Attention with Performers , 2020, ICLR.
[49] David P. Kreil,et al. Hopfield Networks is All You Need , 2020, ICLR.
[50] Christopher Potts,et al. Concadia: Tackling image accessibility with context , 2021, ArXiv.
[51] Ronghang Hu,et al. Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer , 2021, ArXiv.
[52] Devshree Patel,et al. Recent Advances in Video Question Answering: A Review of Datasets and Methods , 2021, ICPR Workshops.
[53] Peter R. Florence,et al. Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.
[54] Andrew Zisserman,et al. A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.
[55] Mohit Bansal,et al. What Is More Likely to Happen Next? Video-and-Language Future Event Prediction , 2020, EMNLP.
[56] Jeremy Kepner,et al. Survey of Machine Learning Accelerators , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).
[57] Hazel Doughty,et al. Rescaling Egocentric Vision , 2020, ArXiv.
[58] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[59] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[60] Alejandro Betancourt,et al. Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models , 2020, ICLR 2020.
[61] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[62] Russ Tedrake,et al. Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.
[63] J. Uijlings,et al. The Open Images Dataset V4 , 2018, International Journal of Computer Vision.
[64] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[65] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.
[66] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[67] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[68] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[69] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[70] Sampled Softmax with Random Fourier Features , 2019, NeurIPS.
[71] Ankush Gupta,et al. Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.
[72] Giovanni Maria Farinella,et al. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[73] Baoyuan Wu,et al. Tencent ML-Images: A Large-Scale Multi-Label Image Database for Visual Representation Learning , 2019, IEEE Access.
[74] Xirong Li,et al. Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[75] Peter Bailis,et al. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search , 2017, 2019 IEEE 35th International Conference on Data Engineering (ICDE).
[76] Andy Zeng,et al. Learning Visual Affordances for Robotic Manipulation , 2019 .
[77] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[78] James M. Rehg,et al. In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.
[79] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[80] David J. Crandall,et al. Deepdiary: Lifelogging image captioning and summarization , 2018, J. Vis. Commun. Image Represent..
[81] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[82] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[83] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[84] Cordelia Schmid,et al. Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos , 2018, ArXiv.
[85] Yi Luo,et al. All-optical machine learning using diffractive deep neural networks , 2018, Science.
[86] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.
[87] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[88] Xirong Li,et al. Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.
[89] Shanxin Yuan,et al. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[90] Joo-Hwee Lim,et al. Summarization of Egocentric Videos: A Comprehensive Survey , 2017, IEEE Transactions on Human-Machine Systems.
[91] Nicholas Rhinehart,et al. First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[92] Quoc V. Le,et al. Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.
[93] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[94] Bolei Zhou,et al. Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.
[95] Kris M. Kitani,et al. Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[96] E Chandra,et al. A Review on Automatic Speech Recognition Architecture and Approaches , 2016 .
[97] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[98] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[99] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[100] Stefan Lee,et al. Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[101] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.
[102] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[103] Larry H. Matthies,et al. Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[104] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[105] Yong Jae Lee,et al. Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.
[106] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[107] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[108] Ping Li,et al. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.
[109] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[110] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.
[111] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.
[112] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[113] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.
[114] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[115] Larry H. Matthies,et al. First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[116] Cheng Li,et al. Pixel-Level Hand Detection in Ego-centric Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[117] Reza Ebrahimpour,et al. Mixture of experts: a literature survey , 2014, Artificial Intelligence Review.
[118] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.
[119] Deva Ramanan,et al. Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[120] Fernando De la Torre,et al. Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[121] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[122] Yoshua Bengio,et al. Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.
[123] Michael S. Ryoo,et al. Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.
[124] Ali Farhadi,et al. Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.
[125] Andrew W. Fitzgibbon,et al. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.
[126] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[127] Takahiro Okabe,et al. Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.
[128] Alexander Gruenstein,et al. Unsupervised Testing Strategies for ASR , 2011, INTERSPEECH.
[129] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[130] Martial Hebert,et al. Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.
[131] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.
[132] Rajat Raina,et al. Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.
[133] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.
[134] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.
[135] Rich Caruana,et al. Multitask Learning , 1997, Machine Learning.
[136] Mauro Barbieri,et al. Video summarization: methods and landscape , 2003, SPIE ITCom.
[137] Sebastian Thrun,et al. Lifelong Learning Algorithms , 1998, Learning to Learn.
[138] Robert A. Jacobs,et al. Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.