Multimodal Machine Learning: A Survey and Taxonomy
暂无分享,去创建一个
Louis-Philippe Morency | Tadas Baltrušaitis | Chaitanya Ahuja | Louis-Philippe Morency | T. Baltrušaitis | Chaitanya Ahuja
[1] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[2] Geoffrey E. Hinton,et al. Deep Boltzmann Machines , 2009, AISTATS.
[3] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Kåre Sjölander,et al. An HMM-based system for automatic segmentation and alignment of speech , 2003 .
[5] Heng Tao Shen,et al. Hashing for Similarity Search: A Survey , 2014, ArXiv.
[6] Andrew McCallum,et al. An Introduction to Conditional Random Fields for Relational Learning , 2007 .
[7] John R. Kender,et al. Alignment of Speech to Highly Imperfect Text Transcriptions , 2007, 2007 IEEE International Conference on Multimedia and Expo.
[8] Ruifan Li,et al. Deep correspondence restricted Boltzmann machine for cross-modal retrieval , 2015, Neurocomputing.
[9] Matthew R. Walter,et al. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.
[10] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Zhihong Zeng,et al. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..
[12] Keiichi Tokuda,et al. Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[13] Kate Saenko,et al. Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.
[14] Tao Qin,et al. Global Ranking Using Continuous Conditional Random Fields , 2008, NIPS.
[15] Mohan S. Kankanhalli,et al. Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.
[16] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[17] Hermann Ney,et al. HMM-Based Word Alignment in Statistical Translation , 1996, COLING.
[18] Stephen Clark,et al. Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception , 2015, EMNLP.
[19] Léon Bottou,et al. Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.
[20] Lior Wolf,et al. Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation , 2014, ArXiv.
[21] Richard Sproat,et al. WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.
[22] Jiebo Luo,et al. Unsupervised Alignment of Natural Language Instructions with Video Segments , 2014, AAAI.
[23] Thierry Pun,et al. Multimodal Emotion Recognition in Response to Videos , 2012, IEEE Transactions on Affective Computing.
[24] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..
[25] Piero Cosi,et al. Bimodal recognition experiments with recurrent neural networks , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.
[26] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[27] Bernt Schiele,et al. The Long-Short Story of Movie Description , 2015, GCPR.
[28] Yejin Choi,et al. Collective Generation of Natural Image Descriptions , 2012, ACL.
[29] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.
[30] Yale Song,et al. Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[31] Jiasen Lu,et al. Hierarchical Co-Attention for Visual Question Answering , 2016 .
[32] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.
[33] Kevin Murphy,et al. What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.
[34] Geoffrey E. Hinton,et al. Zero-shot Learning with Semantic Output Codes , 2009, NIPS.
[35] Erik Wilde,et al. What are you talking about? , 2007, IEEE International Conference on Services Computing (SCC 2007).
[36] Moshe Mahler,et al. Dynamic units of visual speech , 2012, SCA '12.
[37] A. Murat Tekalp,et al. Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.
[38] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.
[39] Jiebo Luo,et al. Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.
[40] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .
[41] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[42] Rong Jin,et al. Multiple Kernel Learning for Visual Object Recognition: A Review , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[43] Honglak Lee,et al. Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[44] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[45] Björn Stenger,et al. Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[46] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[47] Belur V. Dasarathy,et al. Medical Image Fusion: A survey of the state of the art , 2013, Inf. Fusion.
[48] Trevor Darrell,et al. Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[49] Abdullah Al Mamun,et al. Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.
[50] Björn W. Schuller,et al. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..
[51] Jing Huang,et al. Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[52] Qin Jin,et al. Video Description Generation using Audio and Visual Cues , 2016, ICMR.
[53] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[54] Christopher Joseph Pal,et al. EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.
[55] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.
[56] Jean Carletta,et al. The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.
[57] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Christopher Joseph Pal,et al. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.
[59] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[60] Balaraman Ravindran,et al. Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning , 2015, NAACL.
[61] Marco Baroni,et al. Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.
[62] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.
[63] Alan Bundy,et al. Dynamic Time Warping , 1984 .
[64] Aykut Erdem,et al. A Distributed Representation Based Query Expansion Approach for Image Captioning , 2015, ACL.
[65] Fernando De la Torre,et al. Generalized time warping for multi-modal alignment of human motion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[66] C. Lawrence Zitnick,et al. Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[67] Akane Sano,et al. Multi-task , Multi-Kernel Learning for Estimating Individual Wellbeing , 2015 .
[68] Meinard Müller,et al. Dynamic Time Warping , 2008 .
[69] Seong-Whan Lee,et al. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis , 2014, NeuroImage.
[70] Frank Keller,et al. Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.
[71] Gemma Boleda,et al. Distributional Semantics in Technicolor , 2012, ACL.
[72] Sven J. Dickinson,et al. Video In Sentences Out , 2012, UAI.
[73] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[74] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.
[75] Chen Yu,et al. On the Integration of Grounding Language and Learning Objects , 2004, AAAI.
[76] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.
[77] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[78] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.
[79] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.
[80] Bernt Schiele,et al. Grounding Action Descriptions in Videos , 2013, TACL.
[81] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.
[82] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[83] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.
[84] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[85] Björn W. Schuller,et al. Hidden Conditional Random Fields for Meeting Segmentation , 2007, 2007 IEEE International Conference on Multimedia and Expo.
[86] Philip S. Yu,et al. Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.
[87] Eugene Charniak,et al. Nonparametric Method for Data-driven Image Captioning , 2014, ACL.
[88] B.P. Yuhas,et al. Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.
[89] Björn W. Schuller,et al. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.
[90] Markus Kächele,et al. Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.
[91] Paul A. Viola,et al. Unsupervised improvement of visual detectors using cotraining , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[92] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[93] Biing-Hwang Juang,et al. Hidden Markov Models for Speech Recognition , 1991 .
[94] Trevor Darrell,et al. Co-Adaptation of audio-visual speech and gesture classifiers , 2006, ICMI '06.
[95] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.
[96] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[97] Jean Maillard,et al. Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.
[98] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[99] George Trigeorgis,et al. Deep Canonical Time Warping , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[100] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[101] Raghavendra Udupa,et al. Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.
[102] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.
[103] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.
[104] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[105] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[106] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.
[107] Loïc Kessous,et al. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech , 2008, Affect and Emotion in Human-Computer Interaction.
[108] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.
[109] Geoffrey Zweig,et al. Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.
[110] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[111] Wenwu Zhu,et al. Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.
[112] Dongqing Zhang,et al. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.
[113] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.
[114] Jeffrey Mark Siskind,et al. Grounded Language Learning from Video Described with Sentences , 2013, ACL.
[115] Carina Silberer,et al. Grounded Models of Semantic Representation , 2012, EMNLP.
[116] Alex Pentland,et al. Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[117] Qin Jin,et al. Multi-modal Dimensional Emotion Recognition using Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.
[118] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[119] Ling Shao,et al. Multimodal Dynamic Networks for Gesture Recognition , 2014, ACM Multimedia.
[120] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[121] Kevin P. Murphy,et al. A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[122] Christian Wolf,et al. ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..
[123] Peter Robinson,et al. Dimensional affect recognition using Continuous Conditional Random Fields , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).
[124] Geoffrey E. Hinton,et al. Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.
[125] Björn W. Schuller,et al. AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.
[126] Hervé Bourlard,et al. A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.
[127] Yejin Choi,et al. Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.
[128] Rainer Lienhart,et al. Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.
[129] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[130] Salil Deena,et al. Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model , 2009, ISVC.
[131] Ruslan Salakhutdinov,et al. Generating Images from Captions with Attention , 2015, ICLR.
[132] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.
[133] Lianhong Cai,et al. Multi-level Fusion of Audio and Visual Features for Speaker Identification , 2006, ICB.
[134] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[135] Sebastian Nowozin,et al. On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.
[136] Rainer Stiefelhagen,et al. Aligning plot synopses to videos for story-based retrieval , 2015, International Journal of Multimedia Information Retrieval.
[137] Hatice Gunes,et al. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.
[138] Wu-Jun Li,et al. Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[139] Roland Göcke,et al. Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.
[140] Fernando De la Torre,et al. Facial Expression Analysis , 2011, Visual Analysis of Humans.
[141] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .
[142] Jean-Philippe Thiran,et al. Dynamic modality weighting for multi-stream hmms inaudio-visual speech recognition , 2008, ICMI '08.
[143] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[144] Frank Keller,et al. Image Description using Visual Dependency Representations , 2013, EMNLP.
[145] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[146] H. Hotelling. Relations Between Two Sets of Variates , 1936 .
[147] Jordi Luque,et al. Audio-to-text alignment for speech recognition with very limited resources , 2014, INTERSPEECH.
[148] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[149] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[150] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[151] Xu Jia,et al. Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[152] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.
[153] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[154] Hwee Tou Ng,et al. Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages , 2012, J. Artif. Intell. Res..
[155] Vladimir Pavlovic,et al. Isotonic CCA for sequence alignment and activity recognition , 2011, 2011 International Conference on Computer Vision.
[156] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.
[157] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[158] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[159] Heiga Zen,et al. Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.
[160] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[161] Karl Stratos,et al. Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.
[162] Petros Maragos,et al. Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.
[163] Angeliki Lazaridou,et al. Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.
[164] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[165] Vaibhava Goel,et al. Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[166] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.
[167] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[168] Dhruv Batra,et al. Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.
[169] Wei Liu,et al. Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.
[170] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.
[171] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[172] L. Barsalou. Grounded cognition. , 2008, Annual review of psychology.
[173] Maja Pantic,et al. The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.
[174] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.
[175] Yansong Feng,et al. Visual Information in Semantic Representation , 2010, NAACL.
[176] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.
[177] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.
[178] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.
[179] Raman Arora,et al. Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[180] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.
[181] Theodoros Iliou,et al. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.
[182] Rainer Stiefelhagen,et al. Book2Movie: Aligning video scenes with book chapters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[183] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.
[184] Carina Silberer,et al. Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.
[185] Stéphane Ayache,et al. Majority Vote of Diverse Classifiers for Late Fusion , 2014, S+SSPR.
[186] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.
[187] Liang Lin,et al. I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.
[188] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[189] Jianping Yin,et al. Multiple Kernel Learning in the Primal for Multimodal Alzheimer’s Disease Classification , 2013, IEEE Journal of Biomedical and Health Informatics.
[190] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[191] Behrooz Mahasseni,et al. Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[192] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[193] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .
[194] Fernando De la Torre,et al. Canonical Time Warping for Alignment of Human Behavior , 2009, NIPS.
[195] Trevor Darrell,et al. Multi-View Learning in the Presence of View Disagreement , 2008, UAI 2008.
[196] Stephen Clark,et al. Grounding Semantics in Olfactory Perception , 2015, ACL.
[197] Haohan Wang,et al. Multimodal Transfer Deep Learning for Audio Visual Recognition , 2014, ArXiv.
[198] Sidney K. D'Mello,et al. A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..
[199] Web Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings , 2010 .
[200] Yale Song,et al. Multimodal human behavior analysis: learning correlation and interaction across modalities , 2012, ICMI '12.
[201] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..
[202] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.
[203] Qi Tian,et al. A survey of recent advances in visual feature detection , 2015, Neurocomputing.
[204] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.
[205] Nikos Paragios,et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[206] Koray Kavukcuoglu,et al. Pixel Recurrent Neural Networks , 2016, ICML.
[207] Marcel Worring,et al. Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .
[208] Eduard H. Hovy,et al. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.
[209] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[210] Louis-Philippe Morency,et al. Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.
[211] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[212] Tobias Scheffer,et al. Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics , 2004, Machine Learning.
[213] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[214] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[215] Michael I. Jordan,et al. Factorial Hidden Markov Models , 1995, Machine Learning.
[216] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.
[217] Yiannis Aloimonos,et al. Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.
[218] Adwait Ratnaparkhi,et al. Trainable Methods for Surface Natural Language Generation , 2000, ANLP.
[219] Yu-Chiang Frank Wang,et al. A Novel Multiple Kernel Learning Framework for Heterogeneous Feature Fusion and Variable Selection , 2012, IEEE Transactions on Multimedia.
[220] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[221] Xiaogang Wang,et al. Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[222] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[223] Christopher Potts,et al. Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.
[224] Louis-Philippe Morency,et al. Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.
[225] Gwen Littlewort,et al. Multiple kernel learning for emotion recognition in the wild , 2013, ICMI '13.
[226] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.
[227] Nazli Ikizler-Cinbis,et al. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..
[228] Pushpak Bhattacharyya,et al. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages , 2010, NAACL.
[229] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.
[230] Ethem Alpaydin,et al. Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..
[231] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.
[232] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[233] Trevor Darrell,et al. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[234] Frank K. Soong,et al. TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.
[235] Colin Fyfe,et al. Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.
[236] Elia Bruni,et al. Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..
[237] Eric P. Xing,et al. Learning Concept Taxonomies from Multi-modal Data , 2016, ACL.
[238] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[239] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.
[240] Cordelia Schmid,et al. Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[241] Anoop Sarkar,et al. Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.
[242] Sanja Fidler,et al. A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[243] Björn W. Schuller,et al. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.
[244] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..
[245] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[246] Jun Wang,et al. Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.
[247] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.
[248] Zheru Chi,et al. Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning , 2014, ICMI.
[249] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.
[250] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[251] Gert R. G. Lanckriet,et al. Learning Multi-modal Similarity , 2010, J. Mach. Learn. Res..
[252] Max M. Louwerse,et al. Symbol Interdependency in Symbolic and Embodied Cognition , 2011, Top. Cogn. Sci..
[253] Yueting Zhuang,et al. The classification of multi-modal data with hidden conditional random field , 2015, Pattern Recognit. Lett..
[254] Radu Horaud,et al. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[255] Gwenn Englebienne,et al. Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[256] Ronan Collobert,et al. Phrase-based Image Captioning , 2015, ICML.
[257] Jeffrey P. Bigham,et al. VizWiz: nearly real-time answers to visual questions , 2010, W4A.
[258] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.
[259] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.
[260] Vladimir Pavlovic,et al. Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.
[261] C. V. Jawahar,et al. Choosing Linguistics over Vision to Describe Images , 2012, AAAI.
[262] J. Kruskal. An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .