Optimal transport-based machine learning to match specific expression patterns in omics data

We present two algorithms designed to learn a pattern of correspondence between two data sets in situations where it is desirable to match elements that exhibit a relationship belonging to a known parametric model. In the motivating case study, the challenge is to better understand micro-RNA (miRNA) regulation in the striatum of Huntington’s disease (HD) model mice. The two data sets contain miRNA and messenger-RNA (mRNA) data, respectively, each data point consisting in a multi-dimensional profile. The biological hypothesis is that if a miRNA induces the degradation of a target mRNA or blocks its translation into proteins, or both, then the profile of the former should be similar to minus the profile of the latter (a particular form of affine relationship). The algorithms unfold in two stages. During the first stage, an optimal transport plan P and an optimal affine transformation are learned, using the Sinkhorn-Knopp algorithm and a mini-batch gradient descent. During the second stage, P is exploited to derive either several co-clusters or several sets of matched elements. We share codes that implement our algorithms. A simulation study illustrates how they work and perform. A brief summary of the real data application in the motivating case-study further illustrates the applicability and interest of the algorithms.

[1]  Heng Yang,et al.  TEASER: Fast and Certifiable Point Cloud Registration , 2021, IEEE Transactions on Robotics.

[2]  Marco Cuturi,et al.  Computational Optimal Transport: With Applications to Data Science , 2019 .

[3]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[4]  Ievgen Redko,et al.  Co-clustering through Optimal Transport , 2017, ICML.

[5]  Giovanni Coppola,et al.  Integrated genomics and proteomics to define huntingtin CAG length-dependent networks in HD Mice , 2016, Nature Neuroscience.

[6]  Lipeng Dong,et al.  miRNA-20b inhibits cerebral ischemia-induced inflammation through targeting NLRP3 , 2018, International journal of molecular medicine.

[7]  R. Gribonval,et al.  Learning with minibatch Wasserstein : asymptotic and gradient properties , 2019, AISTATS.

[8]  J. Møller,et al.  Determinantal point process models and statistical inference , 2012, 1205.4818.

[9]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..

[10]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[11]  Mohamed Nadif,et al.  Graph modularity maximization as an effective method for co-clustering text data , 2016, Knowl. Based Syst..

[12]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[13]  Adrian Baddeley,et al.  spatstat: An R Package for Analyzing Spatial Point Patterns , 2005 .

[14]  Gérard Govaert,et al.  Model-Based Co-clustering for Continuous Data , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[15]  Param Priya Singh,et al.  Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses , 2018, bioRxiv.

[16]  Christian Neri,et al.  Combining feature selection and shape analysis uncovers precise rules for miRNA regulation in Huntington’s disease mice , 2020, BMC Bioinformatics.

[17]  Aaron Watters,et al.  Spatiotemporal dynamics of molecular pathology in amyotrophic lateral sclerosis , 2018, Science.

[18]  S. Horvath,et al.  MicroRNA signatures of endogenous Huntingtin CAG repeat expansion in mice , 2018, PloS one.

[19]  Gabriel Peyré,et al.  Gromov-Wasserstein Averaging of Kernel and Distance Matrices , 2016, ICML.