Natural representation of composite data with replicated autoencoders

Generative processes in biology and other fields often produce data that can be regarded as resulting from a composition of basic features. Here we present an unsupervised method based on autoencoders for inferring these basic features of data. The main novelty in our approach is that the training is based on the optimization of the ‘local entropy’ rather than the standard loss, resulting in a more robust inference, and enhancing the performance on this type of data considerably. Algorithmically, this is realized by training an interacting system of replicated autoencoders. We apply this method to synthetic and protein sequence data, and show that it is able to infer a hidden representation that correlates well with the underlying generative process, without requiring any prior knowledge. AUTHOR SUMMARY Extracting compositional features from noisy data and identifying the corresponding generative models is a fundamental challenge across sciences. The composition of elementary features can have highly non-linear effects which makes them very hard to identify from experimental data. In biology, for instance, one challenge is to identify the key steps or components of molecular and cellular processes. Representative examples are the modeling of protein sequences as the composition of patterns influenced by phylogeny or the identification of gene clusters in which the presence of specific genes depends on the evolutionary history of the cell. Here we present an unsupervised machine learning technique for the analysis of compositional data which is based on entropic neural autoencoders. Our approach aims at finding deep autoencoders that are highly invariant with respect to perturbations in the inputs and in the parameters. The procedure is efficient to implement and we have validated it both on synthetic and protein sequence data, where it can be shown that the latent variables of the autoencoders are non trivially correlated with the true underlying generative processes. Our results suggests that the local entropy approach represents a general valuable tool for the extraction of compositional features in hard unsupervised learning problems.

[1]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[2]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[3]  Andrea Pagnani,et al.  Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon , 2015, PloS one.

[4]  Cole Trapnell,et al.  Defining cell types and states with single-cell genomics , 2015, Genome research.

[5]  Justin K. Romberg,et al.  An Overview of Low-Rank Matrix Recovery From Incomplete Observations , 2016, IEEE Journal of Selected Topics in Signal Processing.

[6]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[7]  M. Weigt,et al.  Context-Aware Prediction of Pathogenicity of Missense Mutations Involved in Human Disease , 2017, bioRxiv.

[8]  Lucy J. Colwell,et al.  Inferring interaction partners from protein sequences , 2016, Proceedings of the National Academy of Sciences.

[9]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[10]  Jose Davila-Velderrain,et al.  Author Correction: Single-cell transcriptomic analysis of Alzheimer’s disease , 2019, Nature.

[11]  Simona Cocco,et al.  Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins , 2019, Neural Computation.

[12]  Manolis Kellis,et al.  Single-cell transcriptomic analysis of Alzheimer’s disease , 2019, Nature.

[13]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[14]  Simona Cocco,et al.  Inverse statistical physics of protein sequences: a key issues review , 2017, Reports on progress in physics. Physical Society.

[15]  Carlo Baldassi,et al.  Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis , 2016, Proceedings of the National Academy of Sciences.

[16]  M. Mézard Mean-field message-passing equations in the Hopfield model and its generalizations. , 2016, Physical review. E.

[17]  C. Cañestro,et al.  Evolution by gene loss , 2016, Nature Reviews Genetics.

[18]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[19]  Philip L. F. Johnson,et al.  The complete genome sequence of a Neanderthal from the Altai Mountains , 2013 .

[20]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.

[21]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  D. Baker,et al.  Protein interaction networks revealed by proteome coevolution , 2019, Science.

[23]  Carlo Baldassi,et al.  Shaping the learning landscape in neural networks around wide flat minima , 2019, Proceedings of the National Academy of Sciences.

[24]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[25]  Christian Borgs,et al.  Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes , 2016, Proceedings of the National Academy of Sciences.

[26]  Rémi Monasson,et al.  Emergence of Compositional Representations in Restricted Boltzmann Machines , 2016, Physical review letters.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  John M. Walker,et al.  Comparative Genomics , 2007, Methods In Molecular Biology™.

[29]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[30]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[31]  Simona Cocco,et al.  Learning protein constitutive motifs from sequence data , 2018, eLife.

[32]  Michele Caselle,et al.  Statistics of shared components in complex component systems , 2017, 1707.08356.

[33]  S. Linnarsson,et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq , 2015, Science.