Multimodal Sparse Representation Learning and Cross-Modal Synthesis

Humans have a natural ability to process and relate concurrent sensations in different sensory modalities such as vision, hearing, smell, and taste. In order for artificial intelligence to be more human-like in their capabilities, it needs to be able to interpret and translate multimodal information. However, multimodal data are heterogenous, and the relationship between modalities is often complex. For example, there exist a number of correct ways to draw an image given a text description. Similarly, many text descriptions can be valid for an image. Summarizing and translating multimodal data is therefore challenging. In this thesis, I describe multimodal sparse coding schemes that can learn to represent multiple data modalities jointly. A key premise behind joint sparse coding is the representational power that captures complementary information while reducing statistical redundancy. As a result, my schemes can improve the performance of classification and retrieval tasks involving co-occurring data modalities. Building on the deep learning framework, I also present probabilistic generative models that produce new data conditioned on an input from another data modality. Specifically, I develop text-to-image synthesis models based on generative adversarial networks (GAN). To improve the visual realism and the diversity of generated images, I propose additional objective functions and a new GAN architecture. Furthermore, I propose a novel sampling strategy for training data that promotes output diversity under adversarial setting.

[1]  Pierre Vandergheynst,et al.  Learning Bimodal Structure in Audio–Visual Data , 2009, IEEE Transactions on Neural Networks.

[2]  Thomas F. Quatieri,et al.  Detecting Depression using Vocal, Facial and Semantic Communication Cues , 2016, AVEC@ACM Multimedia.

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[5]  D.P. Skinner,et al.  The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[6]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[7]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[8]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[11]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[18]  Yike Guo,et al.  I2T2I: Learning text to image synthesis with textual data augmentation , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[19]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[20]  Alberto Del Bimbo,et al.  A multimodal feature learning approach for sentiment analysis of social network multimedia , 2016, Multimedia Tools and Applications.

[21]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[22]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[23]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[24]  Sandeep Subramanian,et al.  Adversarial Generation of Natural Language , 2017, Rep4NLP@ACL.

[25]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[28]  Douglas E. Sturim,et al.  Language Recognition via Sparse Coding , 2016, INTERSPEECH.

[29]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[30]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31]  Marcus Liwicki,et al.  TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network , 2017, ArXiv.

[32]  Dieter Fox,et al.  Multipath Sparse Coding Using Hierarchical Matching Pursuit , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[34]  Thomas S. Huang,et al.  Image super-resolution as sparse representation of raw image patches , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[36]  Dean P. Foster,et al.  Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[37]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[38]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[39]  William M. Campbell,et al.  Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[43]  Trevor Darrell,et al.  Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[44]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[46]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[47]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[49]  Jon Gauthier Conditional generative adversarial nets for convolutional face generation , 2015 .

[50]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[51]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Leon A. Gatys,et al.  Texture Synthesis Using Convolutional Neural Networks , 2015, NIPS.

[54]  Chong-sun Kim Canonical Analysis of Several Sets of Variables , 1973 .

[55]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[56]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[57]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[58]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[59]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  H. T. Kung,et al.  Twitter Geolocation and Regional Classification via Sparse Coding , 2015, ICWSM.

[61]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[62]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[63]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[64]  Lior Wolf,et al.  Language Generation with Recurrent Generative Adversarial Networks without Pre-training , 2017, ArXiv.

[65]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[66]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  H. T. Kung,et al.  Adversarial nets with perceptual losses for text-to-image synthesis , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[68]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Guillermo Sapiro,et al.  Sparse Representation for Computer Vision and Pattern Recognition , 2010, Proceedings of the IEEE.

[71]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[72]  Albert Brown Lyons Plant names, scientific and popular , 2013 .

[73]  Xiaodong Liu,et al.  Language-Based Image Editing with Recurrent Attentive Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[74]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[75]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[76]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[77]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[78]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[79]  Gang Hua,et al.  CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[80]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Patrick Poirson,et al.  Multimodal Stacked Denoising Autoencoders , 2013 .

[82]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[83]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.