Benchmarking of alignment-free sequence comparison methods

BackgroundAlignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment.ResultsHere, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events.ConclusionThe interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.

[1]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[2]  Rui Dong,et al.  Positional Correlation Natural Vector: A Novel Method for Genome Comparison , 2020, International journal of molecular sciences.

[3]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[4]  Timothy J. Harlow,et al.  Highways of gene sharing in prokaryotes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[6]  Ioannis Xenarios,et al.  Taxon sampling unequally affects individual nodes in a phylogenetic tree: consequences for model gene tree construction in SwissTree , 2017, bioRxiv.

[7]  Xiangde Zhang,et al.  Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. , 2012, Journal of theoretical biology.

[8]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2020, PloS one.

[9]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[10]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[11]  Cheng Soon Ong,et al.  kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[12]  I. Miklós,et al.  Dynamics of Genome Rearrangement in Bacterial Populations , 2008, PLoS genetics.

[13]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[14]  Burkhard Morgenstern,et al.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences , 2019, GigaScience.

[15]  M. Kuhner,et al.  Practical performance of tree comparison metrics. , 2015, Systematic biology.

[16]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[17]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[18]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[19]  H. W. Parker,et al.  Systematic Zoology , 1896, Nature.

[20]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[21]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[22]  W. Martin,et al.  Getting a better picture of microbial evolution en route to a network of genomes , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[24]  Paul Greenfield,et al.  k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank , 2017, mSystems.

[25]  Xin Chen,et al.  Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction , 2014, BMC Research Notes.

[26]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[27]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[28]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[29]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[30]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[31]  C R Woese,et al.  Classification of methanogenic bacteria by 16S ribosomal RNA characterization. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[32]  M. Ragan,et al.  Inferring phylogenies of evolving sequences without multiple sequence alignment , 2014, Scientific Reports.

[33]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[34]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[35]  Bruno Bauwens,et al.  LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification , 2018, Bioinform..

[36]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[37]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[38]  Pandurang Kolekar,et al.  Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. , 2012, Molecular phylogenetics and evolution.

[39]  Chenhui Yang,et al.  An estimator for local analysis of genome based on the minimal absent word. , 2016, Journal of theoretical biology.

[40]  M. Ragan,et al.  A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF , 2016, Scientific Reports.

[41]  Sung-Hou Kim,et al.  A genome Tree of Life for the Fungi kingdom , 2017, Proceedings of the National Academy of Sciences.

[42]  Satish Rao,et al.  Quartet MaxCut: a fast algorithm for amalgamating quartet trees. , 2012, Molecular phylogenetics and evolution.

[43]  Jed A. Fuhrman,et al.  CAFE: aCcelerated Alignment-FrEe sequence analysis , 2017, Nucleic Acids Res..

[44]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2019, bioRxiv.

[45]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[46]  Cheong Xin Chan,et al.  Recapitulating phylogenies using k-mers: from trees to networks , 2016, F1000Research.

[47]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[48]  Tom Slezak,et al.  kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome , 2015, Bioinform..

[49]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[50]  T. Warnow,et al.  Unblended disjoint tree merging using GTM improves species tree estimation , 2020, BMC Genomics.

[51]  Marc S Halfon,et al.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs , 2008, Genome Biology.

[52]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[53]  Dhundy Bastola,et al.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis , 2014, Briefings Bioinform..

[54]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[55]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[56]  Martin R. Smith,et al.  Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets , 2019, Biology Letters.

[57]  Eric Bapteste,et al.  INAUGURAL ARTICLE by a Recently Elected Academy Member:Pattern pluralism and the Tree of Life hypothesis , 2007 .

[58]  Erich Bornberg-Bauer,et al.  Rapid similarity search of proteins using alignments of domain arrangements , 2014, Bioinform..

[59]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[60]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[61]  Diogo Pratas,et al.  Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements , 2020, GigaScience.

[62]  Changchuan Yin,et al.  An improved model for whole genome phylogenetic analysis by Fourier transform. , 2015, Journal of theoretical biology.

[63]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[64]  Vineet Bafna,et al.  Skmer: assembly-free and alignment-free sample identification using genome skims , 2019, Genome Biology.

[65]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[66]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[67]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[68]  Sagi Snir,et al.  Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees , 2018, RECOMB-CG.

[69]  Jonas S. Almeida,et al.  Entropic Profiler – detection of conservation in genomes using information theory , 2009, BMC Research Notes.

[70]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[71]  Robert G. Beiko,et al.  A simulation test bed for hypotheses of genome evolution , 2007, Bioinform..

[72]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[73]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2019, bioRxiv.

[74]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[75]  Gerhard G. Thallinger,et al.  Complete Mitochondrial DNA Sequences of the Threadfin Cichlid (Petrochromis trewavasae) and the Blunthead Cichlid (Tropheus moorii) and Patterns of Mitochondrial Genome Evolution in Cichlid Fishes , 2013, PloS one.

[76]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[77]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[78]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[79]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[80]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[81]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[82]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[83]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[84]  F. Balloux,et al.  Large-scale network analysis captures biological features of bacterial plasmids , 2020, Nature Communications.

[85]  Benjamin T James,et al.  A survey and evaluations of histogram-based statistics in alignment-free sequence comparison , 2017, Briefings Bioinform..

[86]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[87]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019 .

[88]  Mark A Ragan,et al.  Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella , 2011, BMC Genomics.

[89]  James M. Hogan,et al.  Alignment-free inference of hierarchical and reticulate phylogenomic relationships , 2017, Briefings Bioinform..

[90]  Mark A. Ragan,et al.  Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer , 2016, Scientific Reports.

[91]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[92]  Burkhard Morgenstern,et al.  Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage , 2019, BMC Bioinformatics.

[93]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[94]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[95]  Eun Ji Kim,et al.  Simulation-based comprehensive benchmarking of RNA-seq aligners , 2016, Nature Methods.

[96]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[97]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[98]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[99]  Matteo Comin,et al.  Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[100]  Donald A. Adjeroh,et al.  K2 and K2*: efficient alignment‐free sequence similarity measurement based on Kendall statistics , 2018, Bioinform..

[101]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[102]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[103]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[104]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[105]  Jonas S. Almeida,et al.  Sequence analysis by iterated maps, a review , 2014, Briefings Bioinform..

[106]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[107]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[108]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[109]  Matteo Comin,et al.  On the comparison of regulatory sequences with multiple resolution Entropic Profiles , 2016, BMC Bioinformatics.