The Gene Mover's Distance: Single-cell similarity via Optimal Transport

This paper introduces the Gene Mover’s Distance, a measure of similarity between a pair of cells based on their gene expression profiles obtained via single-cell RNA sequencing. The underlying idea of the proposed distance is to interpret the gene expression array of a single cell as a discrete probability measure. The distance between two cells is hence computed by solving an Optimal Transport problem between the two corresponding discrete measures. In the Optimal Transport model, we use two types of cost function for measuring the distance between a pair of genes. The first cost function exploits a gene embedding, called gene2vec, which is used to map each gene to a high dimensional vector: the cost of moving a unit of mass of gene expression from a gene to another is set to the Euclidean distance between the corresponding embedded vectors. The second cost function is based on a Pearson distance among pairs of genes. In both cost functions, the more two genes are correlated, the lower is their distance. We exploit the Gene Mover’s Distance to solve two classification problems: the classification of cells according to their condition and according to their type. To assess the impact of our new metric, we compare the performances of a k-Nearest Neighbor classifier using different distances. The computational results show that the Gene Mover’s Distance is competitive with the state-of-the-art distances used in the literature.

[1]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[2]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[3]  Lingling An,et al.  Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey , 2020, Frontiers in Genetics.

[4]  Federico Bassetti,et al.  On the Computation of Kantorovich-Wasserstein Distances Between Two-Dimensional Histograms by Uncapacitated Minimum Cost Flows , 2020, SIAM J. Optim..

[5]  S. Quake,et al.  A survey of human brain transcriptome diversity at the single cell level , 2015, Proceedings of the National Academy of Sciences.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Péter Kovács,et al.  Minimum-cost flow algorithms: an experimental evaluation , 2015, Optim. Methods Softw..

[8]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[9]  C. Czado,et al.  Nonparametric validation of similar distributions and assessment of goodness of fit , 1998 .

[10]  L. V. Kantorovich,et al.  Mathematical Methods of Organizing and Planning Production , 1960 .

[11]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[12]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[13]  F. Bassetti,et al.  Asymptotic Properties and Robustness of Minimum Dissimilarity Estimators of Location-scale Parameters , 2006 .

[14]  J. Lee,et al.  Single-cell RNA sequencing technologies and bioinformatics pipelines , 2018, Experimental & Molecular Medicine.

[15]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.

[16]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[17]  Justine Jia Wen Seow,et al.  Single‐Cell RNA Sequencing for Precision Oncology: Current State-of-Art , 2020, Journal of the Indian Institute of Science.

[18]  Julien Rabin,et al.  Adaptive color transfer with relaxed optimal transport , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[19]  Jingcheng Du,et al.  Gene2vec: distributed representation of genes based on co-expression , 2018, BMC Genomics.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Q. Deng,et al.  Single-cell RNA sequencing: Technical advancements and biological applications. , 2017, Molecular aspects of medicine.

[22]  Huanming Yang,et al.  Single-Cell Exome Sequencing and Monoclonal Evolution of a JAK2-Negative Myeloproliferative Neoplasm , 2012, Cell.

[23]  Peter J Park,et al.  Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data , 2018, Genome research.

[24]  Fabian J Theis,et al.  Current best practices in single‐cell RNA‐seq analysis: a tutorial , 2019, Molecular systems biology.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Wei Liu,et al.  CancerSEA: a cancer single-cell state atlas , 2018, Nucleic Acids Res..

[27]  Max Sommerfeld,et al.  Inference for empirical Wasserstein distances on finite spaces , 2016, 1610.03287.

[28]  C. Shriver,et al.  Single-cell sequencing and tumorigenesis: improved understanding of tumor evolution and metastasis , 2017, Clinical and Translational Medicine.

[29]  C. Villani Optimal Transport: Old and New , 2008 .

[30]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Jean Yee Hwa Yang,et al.  Impact of similarity metrics on single-cell RNA-seq data clustering , 2018, Briefings Bioinform..

[32]  Marcel J. T. Reinders,et al.  A comparison of automatic cell identification methods for single-cell RNA sequencing data , 2019, Genome Biology.

[33]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[34]  J. Gribben,et al.  Single cell analysis of clonal architecture in acute myeloid leukaemia , 2018, Leukemia.

[35]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[36]  Giuseppe Savaré,et al.  Optimal Entropy-Transport problems and a new Hellinger–Kantorovich distance between positive measures , 2015, 1508.07941.

[37]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[38]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[39]  G. La Torre,et al.  Do the smoking intensity and duration, the years since quitting, the methodological quality and the year of publication of the studies affect the results of the meta-analysis on cigarette smoking and Acute Myeloid Leukemia (AML) in adults? , 2016, Critical reviews in oncology/hematology.

[40]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[41]  G. Pinkus,et al.  Single-Cell RNA-Seq Reveals AML Hierarchies Relevant to Disease Progression and Immunity , 2019, Cell.

[42]  R. Zamarchi,et al.  Single-Cell Analysis of Circulating Tumor Cells: How Far Have We Come in the -Omics Era? , 2019, Front. Genet..

[43]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[44]  David Coeurjolly,et al.  SPOT , 2019, ACM Trans. Graph..

[45]  F. A. Lagunas-Rangel,et al.  Acute Myeloid Leukemia—Genetic Alterations and Their Clinical Prognosis , 2017, International journal of hematology-oncology and stem cell research.

[46]  Nir Friedman,et al.  Gene expression cartography , 2019, Nature.

[47]  Christopher A. Miller,et al.  Clonal Architecture of Secondary Acute Myeloid Leukemia Defined by Single-Cell Sequencing , 2014, PLoS genetics.

[48]  G. Walther,et al.  Earth Mover’s Distance (EMD): A True Metric for Comparing Biomarker Expression Levels in Cell Populations , 2016, PloS one.