Reconstructing evolutionary trees in parallel for massive sequences

BackgroundBuilding the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel.ResultsHPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis.ConclusionsIn this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/.

[1]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[2]  Maria Pamela C. David,et al.  Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies , 2010, BMC Bioinformatics.

[3]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[4]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[5]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[6]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[7]  B. Helmreich,et al.  Screening and monitoring microbial xenobiotics’ biodegradation via rapid, inexpensive and easy to perform microplate , 2014 .

[8]  Matthieu Muffato,et al.  Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference , 2015, Systematic biology.

[9]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[10]  Quan Zou,et al.  Multiple sequence alignment and reconstructing phylogenetic trees with Hadoop , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[11]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[12]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[13]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[14]  Vincent Berry,et al.  Models, algorithms and programs for phylogeny reconciliation , 2011, Briefings Bioinform..

[15]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[16]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[17]  Xin Chen,et al.  Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction , 2014, BMC Research Notes.

[18]  V. Morell The Roots of Phylogeny , 1996 .

[19]  Cécile Ané,et al.  Detecting Phylogenetic Breakpoints and Discordance from Genome-Wide Alignments for Species Tree Reconstruction , 2011, Genome biology and evolution.

[20]  A. Stamatakis,et al.  The Phylogenetic Likelihood Library , 2014, Systematic biology.

[21]  Sean R. Eddy,et al.  Rfam 11.0: 10 years of RNA families , 2012, Nucleic Acids Res..

[22]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[23]  Guoliang Li,et al.  Extending string similarity join to tolerant fuzzy token matching , 2014, ACM Trans. Database Syst..

[24]  Ramón Doallo,et al.  ProtTest 3: fast selection of best-fit models of protein evolution , 2011, Bioinform..

[25]  Guoliang Li,et al.  A partition-based method for string similarity joins with edit-distance constraints , 2013, TODS.

[26]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[27]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[28]  Christopher L. Schardl,et al.  Kdetrees: Non-parametric Estimation of Phylogenetic Tree Distributions , 2013, Bioinform..

[29]  Hidetoshi Shimodaira,et al.  Mitochondrial genome variation in eastern Asia and the peopling of Japan. , 2004, Genome research.

[30]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[31]  Klaus Peter Schliep,et al.  phangorn: phylogenetic analysis in R , 2010, Bioinform..

[32]  Vassilis Christophides,et al.  High-level change detection in RDF(S) KBs , 2013, TODS.

[33]  Daniel Stubbs,et al.  PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. , 2013, Systematic biology.

[34]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[35]  Yang Liu,et al.  Predicting RNA secondary structure based on the class information and Hopfield network , 2009, Comput. Biol. Medicine.

[36]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[37]  Michael P. Cummings,et al.  A Gateway for Phylogenetic Analysis Powered by Grid Computing Featuring GARLI 2.0 , 2014, Systematic biology.

[38]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[39]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[40]  Xiangxiang Zeng,et al.  HPTree: Reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via NJ model and Hadoop , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[41]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[42]  A. Kawahara,et al.  Phylotranscriptomics: Saturated Third Codon Positions Radically Influence the Estimation of Trees Based on Next-Gen Data , 2013, Genome biology and evolution.

[43]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[44]  Q Zou,et al.  Novel representation of RNA secondary structure used to improve prediction algorithms. , 2011, Genetics and molecular research : GMR.

[45]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[46]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[47]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[48]  Yong Huang,et al.  Regulatory long non-coding RNA and its functions , 2012, Journal of Physiology and Biochemistry.

[49]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[50]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[51]  Yang Liu,et al.  Lnetwork: an efficient and effective method for constructing phylogenetic networks , 2013, Bioinform..

[52]  Yufeng Wu,et al.  COALESCENT‐BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD , 2012, Evolution; international journal of organic evolution.

[53]  J. Peter Gogarten,et al.  The impact of HGT on phylogenomic reconstruction methods , 2014, Briefings Bioinform..

[54]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[55]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[56]  Xi Chen,et al.  CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment , 2017, BMC Bioinformatics.

[57]  M. Ragan,et al.  Inferring phylogenies of evolving sequences without multiple sequence alignment , 2014, Scientific Reports.

[58]  Tuan D. Pham,et al.  Pattern recognition and probabilistic measures in alignment-free sequence analysis , 2014, Briefings Bioinform..

[59]  Jun Zhou,et al.  Probabilistic Reconstruction of Ancestral Gene Orders with Insertions and Deletions , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[60]  Chaodong Zhu,et al.  A DNA Barcoding system integrating multigene sequence data , 2015 .

[61]  Mikko Jousi,et al.  Preoperative assessment and treatment of appendiceal mucocele complicated by acute torsion: a case report , 2014, BMC Research Notes.

[62]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[63]  C. Huttenhower,et al.  PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes , 2013, Nature Communications.

[64]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.