Benchmarking of alignment-free sequence comparison methods

Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference and reconstruction of species trees under horizontal gene transfer and recombination events. The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.

[1]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[2]  W. Martin,et al.  Getting a better picture of microbial evolution en route to a network of genomes , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[3]  Tom Slezak,et al.  kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome , 2015, Bioinform..

[4]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[5]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[6]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[7]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[8]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[9]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[10]  Bruno Bauwens,et al.  LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification , 2018, Bioinform..

[11]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[12]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[13]  Gerhard G. Thallinger,et al.  Complete Mitochondrial DNA Sequences of the Threadfin Cichlid (Petrochromis trewavasae) and the Blunthead Cichlid (Tropheus moorii) and Patterns of Mitochondrial Genome Evolution in Cichlid Fishes , 2013, PloS one.

[14]  Vineet Bafna,et al.  Skmer: assembly-free and alignment-free sample identification using genome skims , 2019, Genome Biology.

[15]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[16]  Jonas S. Almeida,et al.  Sequence analysis by iterated maps, a review , 2014, Briefings Bioinform..

[17]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[18]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[19]  Chenglong Yu,et al.  A protein map and its application. , 2008, DNA and cell biology.

[20]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[21]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[22]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[23]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[24]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[25]  Burkhard Morgenstern,et al.  The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances , 2019, bioRxiv.

[26]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[27]  Marc S Halfon,et al.  Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs , 2008, Genome Biology.

[28]  C R Woese,et al.  Classification of methanogenic bacteria by 16S ribosomal RNA characterization. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Pandurang Kolekar,et al.  Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. , 2012, Molecular phylogenetics and evolution.

[30]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Eric Bapteste,et al.  INAUGURAL ARTICLE by a Recently Elected Academy Member:Pattern pluralism and the Tree of Life hypothesis , 2007 .

[32]  Satish Rao,et al.  Quartet MaxCut: a fast algorithm for amalgamating quartet trees. , 2012, Molecular phylogenetics and evolution.

[33]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[34]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[35]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[36]  Robert G. Beiko,et al.  A simulation test bed for hypotheses of genome evolution , 2007, Bioinform..

[37]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[38]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[39]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[40]  Eun Ji Kim,et al.  Simulation-based comprehensive benchmarking of RNA-seq aligners , 2016, Nature Methods.

[41]  Sagi Snir,et al.  Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees , 2018, RECOMB-CG.

[42]  Burkhard Morgenstern,et al.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences , 2019, GigaScience.

[43]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[44]  Jed A. Fuhrman,et al.  CAFE: aCcelerated Alignment-FrEe sequence analysis , 2017, Nucleic Acids Res..

[45]  Xiangde Zhang,et al.  Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. , 2012, Journal of theoretical biology.

[46]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[47]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[48]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[49]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[50]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[51]  Mark A. Ragan,et al.  Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer , 2016, Scientific Reports.

[52]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[53]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[54]  Paul Greenfield,et al.  k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank , 2017, mSystems.

[55]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[56]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[57]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[58]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[59]  Changchuan Yin,et al.  An improved model for whole genome phylogenetic analysis by Fourier transform. , 2015, Journal of theoretical biology.

[60]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[61]  Cheng Soon Ong,et al.  kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity , 2016, bioRxiv.

[62]  M. Ragan,et al.  Inferring phylogenies of evolving sequences without multiple sequence alignment , 2014, Scientific Reports.

[63]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[64]  Jonas S. Almeida,et al.  Entropic Profiler – detection of conservation in genomes using information theory , 2009, BMC Research Notes.

[65]  Donald A. Adjeroh,et al.  K2 and K2*: efficient alignment‐free sequence similarity measurement based on Kendall statistics , 2018, Bioinform..

[66]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[67]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[68]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[69]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[70]  Sung-Hou Kim,et al.  A genome Tree of Life for the Fungi kingdom , 2017, Proceedings of the National Academy of Sciences.

[71]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[72]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[73]  Burkhard Morgenstern,et al.  Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage , 2019, BMC Bioinformatics.

[74]  Cheong Xin Chan,et al.  Recapitulating phylogenies using k-mers: from trees to networks , 2016, F1000Research.

[75]  Mark A Ragan,et al.  Within-species lateral genetic transfer and the evolution of transcriptional regulation in Escherichia coli and Shigella , 2011, BMC Genomics.

[76]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[77]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[78]  I. Miklós,et al.  Dynamics of Genome Rearrangement in Bacterial Populations , 2008, PLoS genetics.

[79]  M. Ragan,et al.  A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF , 2016, Scientific Reports.

[80]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[81]  Timothy J. Harlow,et al.  Highways of gene sharing in prokaryotes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Ioannis Xenarios,et al.  Taxon sampling unequally affects individual nodes in a phylogenetic tree: consequences for model gene tree construction in SwissTree , 2017, bioRxiv.

[83]  Chenhui Yang,et al.  An estimator for local analysis of genome based on the minimal absent word. , 2016, Journal of theoretical biology.

[84]  Benjamin T James,et al.  A survey and evaluations of histogram-based statistics in alignment-free sequence comparison , 2017, Briefings Bioinform..

[85]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[86]  Matteo Comin,et al.  Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[87]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[88]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[89]  Matteo Comin,et al.  On the comparison of regulatory sequences with multiple resolution Entropic Profiles , 2016, BMC Bioinformatics.

[90]  Xin Chen,et al.  Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction , 2014, BMC Research Notes.

[91]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[92]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[93]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[94]  Erich Bornberg-Bauer,et al.  Rapid similarity search of proteins using alignments of domain arrangements , 2014, Bioinform..