MREC: a fast and versatile framework for aligning and matching data with applications to single cell molecular data

Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  Amit Singer,et al.  Semidefinite programming approach for the quadratic assignment problem with a sparse graph , 2017, Computational Optimization and Applications.

[3]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[4]  P. Kharchenko,et al.  Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain , 2017, Nature Biotechnology.

[5]  Karl-Theodor Sturm,et al.  On the geometry of metric measure spaces , 2006 .

[6]  Dustin G. Mixon,et al.  Monte Carlo approximation certificates for k-means clustering , 2017, 1710.00956.

[7]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[8]  Andrew J. Blumberg,et al.  A polynomial-time relaxation of the Gromov-Hausdorff distance , 2016, ArXiv.

[9]  Chun Jimmie Ye,et al.  Multiplexed droplet single-cell RNA-sequencing using natural genetic variation , 2017, Nature Biotechnology.

[10]  M. Gromov Metric Structures for Riemannian and Non-Riemannian Spaces , 1999 .

[11]  Guillermo Sapiro,et al.  A Theoretical and Computational Framework for Isometry Invariant Recognition of Point Cloud Data , 2005, Found. Comput. Math..

[12]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[13]  Panos M. Pardalos,et al.  Quadratic assignment and related problems : DIMACS workshop, May 20-21, 1993 , 1994 .

[14]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[15]  Thomas A. Funkhouser,et al.  Algorithms to automatically quantify the geometric similarity of anatomical surfaces , 2011, Proceedings of the National Academy of Sciences.

[16]  Richard Sinkhorn Diagonal equivalence to matrices with prescribed row and column sums. II , 1967 .

[17]  Shai Ben-David,et al.  Clustering is Easy When ....What? , 2015, ArXiv.

[18]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[19]  Qi-Xing Huang,et al.  SMAC: Simultaneous Mapping and Clustering Using Spectral Decompositions , 2018, ICML.

[20]  Dustin G. Mixon,et al.  Probably certifiably correct k-means clustering , 2015, Math. Program..

[21]  P. Rigollet,et al.  Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming , 2019, Cell.

[22]  Jonathan Weed,et al.  Statistical Optimal Transport via Factored Couplings , 2018, AISTATS.

[23]  C. Villani Optimal Transport: Old and New , 2008 .

[24]  C. Villani,et al.  Ricci curvature for metric-measure spaces via optimal transport , 2004, math/0412127.

[25]  Kyle Fox,et al.  Computing the Gromov-Hausdorff Distance for Metric Trees , 2015, ISAAC.

[26]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Tingran Gao,et al.  Development and Assessment of Fully Automated and Globally Transitive Geometric Morphometric Methods, With Application to a Biological Comparative Dataset With High Interspecific Variation , 2018, Anatomical record.

[28]  Ron Kimmel,et al.  On Bending Invariant Signatures for Surfaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Andrew J. Blumberg,et al.  Quasi-universality in single-cell sequencing data , 2018, bioRxiv.

[30]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, bioRxiv.

[31]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[32]  Facundo Mémoli,et al.  Gromov–Wasserstein Distances and the Metric Approach to Object Matching , 2011, Found. Comput. Math..

[33]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[34]  M. Gromov Groups of polynomial growth and expanding maps , 1981 .

[35]  Richard Sinkhorn Diagonal equivalence to matrices with prescribed row and column sums. II , 1967 .

[36]  I. Daubechies,et al.  Conformal Wasserstein distances: Comparing surfaces in polynomial time , 2011, 1103.4408.

[37]  Benoit Gaüzère,et al.  A Hungarian Algorithm for Error-Correcting Graph Matching , 2017, GbRPR.

[38]  Ronen Basri,et al.  Tight relaxation of quadratic matching , 2015, SGP '15.