Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

Background Due to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values. Results We introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques are more accurate and robust than the best alternate technique for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate method fails. Conclusions This study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. Our proposed deep learning framework is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).

[1]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[2]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[4]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[5]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[6]  M. Nei,et al.  Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. , 1988, Molecular biology and evolution.

[7]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.

[8]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.

[9]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..

[10]  Sudhir Kumar,et al.  Evolutionary distance estimation under heterogeneous substitution pattern among lineages. , 2002, Molecular biology and evolution.

[11]  Tandy J. Warnow,et al.  OCTAL: Optimal Completion of gene trees in polynomial time , 2018, Algorithms for Molecular Biology.

[12]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[13]  Bernard M. E. Moret,et al.  Rec-I-DCM3: a fast algorithmic technique for reconstructing phylogenetic trees , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  D. Bryant,et al.  Flexible methods for estimating genetic distances from single nucleotide polymorphisms , 2014, bioRxiv.

[15]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[16]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[17]  X. Xia,et al.  DAMBE: software package for data analysis in molecular biology and evolution. , 2001, The Journal of heredity.

[18]  Heterochrony and tooth evolution in hyperodapedontine rhynchosaurs (Reptilia, Diapsida) , 2000 .

[19]  J. Qi,et al.  Whole genome molecular phylogeny of large dsDNA viruses using composition vector method , 2007, BMC Evolutionary Biology.

[20]  Claudio Moraga,et al.  The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning , 1995, IWANN.

[21]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[22]  Sudhir Kumar,et al.  MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. , 2018, Molecular biology and evolution.

[23]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[24]  Jeremy M. Brown,et al.  The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference , 2009, Systematic biology.

[25]  Barbara R. Holland,et al.  Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences , 2006, BMC Bioinformatics.

[26]  Ke Wang,et al.  MIDA: Multiple Imputation Using Denoising Autoencoders , 2017, PAKDD.

[27]  Bernard M. E. Moret,et al.  Performance of Supertree Methods on Various Data Set Decompositions , 2004 .

[28]  C. J-F,et al.  THE COALESCENT , 1980 .

[29]  J. Gauthier Saurischian monophyly and the origin of birds , 1986 .

[30]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[31]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.

[32]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[33]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[34]  G. Soete Additive-tree representations of incomplete dissimilarity data , 1984 .

[35]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[36]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[37]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[38]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[39]  Xiuzhen Huang,et al.  SparRec: An effective matrix completion framework of missing data imputation for GWAS , 2016, Scientific reports.

[40]  F. Lapointe,et al.  Estimating Phylogenies from Lacunose Distance Matrices, with Special Reference to DNA Hybridization Data , 1995 .

[41]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[42]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[43]  Kumardeep Chaudhary,et al.  Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer , 2017, Clinical Cancer Research.

[44]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[45]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[46]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[47]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[48]  Laura Salter Kubatko,et al.  STEM: species tree estimation using maximum likelihood for gene trees under coalescence , 2009, Bioinform..

[49]  Rama S. Singh Rapidly Evolving Genes and Genetic Systems , 2012 .

[50]  Jie Yang,et al.  PTreeRec: Phylogenetic Tree Reconstruction based on genome BLAST distance , 2006, Comput. Biol. Chem..

[51]  Angshul Majumdar,et al.  AutoImpute: Autoencoder based imputation of single-cell RNA-seq data , 2018, Scientific Reports.

[52]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[53]  A. Kropinski,et al.  UFV-P2 as a member of the Luz24likevirus genus: a new overview on comparative functional genome analyses of the LUZ24-like phages , 2014, BMC Genomics.

[54]  Loren Terveen,et al.  Beyond Recommender Systems: Helping People Help Each Other , 2001 .

[55]  O. Gascuel,et al.  Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. , 2003, Molecular biology and evolution.

[56]  X. Xia Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. , 2009, Molecular phylogenetics and evolution.

[57]  Xiuzhen Huang,et al.  SPARCoC: A New Framework for Molecular Pattern Discovery and Cancer Gene Identification , 2015, PloS one.

[58]  Michael Q. Ding,et al.  Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics , 2017, Molecular Cancer Research.

[59]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[60]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[61]  Tandy J. Warnow,et al.  Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees , 2004, IEEE Computer Society Computational Systems Bioinformatics Conference.

[62]  François-Joseph Lapointe,et al.  A weighted least-squares approach for inferring phylogenies from incomplete distance matrices , 2004, Bioinform..

[63]  X. Xia DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution , 2018, Molecular biology and evolution.

[64]  Y. Kluger,et al.  Zero-preserving imputation of scRNA-seq data using low-rank approximation , 2018, bioRxiv.

[65]  M. Nei,et al.  MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. , 2007, Molecular biology and evolution.

[66]  W. Brown,et al.  Rapid evolution of animal mitochondrial DNA. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Xuhua Xia,et al.  Imputing missing distances in molecular phylogenetics , 2018, bioRxiv.

[68]  Tandy J. Warnow,et al.  Estimating Optimal Species Trees from Incomplete Gene Trees Under Deep Coalescence , 2012, J. Comput. Biol..

[69]  Alain Guénoche,et al.  The triangles method to build X-trees from incomplete distance matrices , 2001, RAIRO Oper. Res..

[70]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[71]  Andy Purvis,et al.  Phylogenetic supertrees: Assembling the trees of life. , 1998, Trends in ecology & evolution.

[72]  John P. Huelsenbeck,et al.  WHEN ARE FOSSILS BETTER THAN EXTANT TAXA IN PHYLOGENETIC ANALYSIS , 1991 .

[73]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[74]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[75]  K. Huber,et al.  Reconstructing (super)trees from data sets with missing distances: not all is lost. , 2015, Molecular biology and evolution.

[76]  M. Rosenberg,et al.  Traditional phylogenetic reconstruction methods reconstruct shallow and deep evolutionary relationships equally well. , 2001, Molecular biology and evolution.

[77]  Rezwana Reaz,et al.  Accurate Phylogenetic Tree Reconstruction from Quartets: A Heuristic Approach , 2014, PloS one.