Representation Learning with Contrastive Predictive Coding

While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence. In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, we demonstrate that our approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

[1]  Peter Elias,et al.  Predictive coding-I , 1955, IRE Trans. Inf. Theory.

[2]  M. R. Schroeder,et al.  Adaptive predictive coding of speech signals , 1970, Bell Syst. Tech. J..

[3]  A. Borst Seeing smells: imaging olfactory learning in bees , 1999, Nature Neuroscience.

[4]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[5]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[6]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[7]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[8]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[9]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[10]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11]  Karl J. Friston,et al.  A theory of cortical responses , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[12]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[15]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[18]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[25]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[28]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[33]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[35]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[37]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[38]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[39]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[40]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[41]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[42]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[43]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[44]  Shane Legg,et al.  DeepMind Lab , 2016, ArXiv.

[45]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[46]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[49]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[53]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.