Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment

A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality.

[1]  J. Pei,et al.  Multiple protein sequence alignment. , 2008, Current opinion in structural biology.

[2]  Stefan R. Henz,et al.  Reference-guided assembly of four diverse Arabidopsis thaliana genomes , 2011, Proceedings of the National Academy of Sciences.

[3]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[4]  Fabian Sievers,et al.  Simple chained guide trees give high-quality protein multiple sequence alignments , 2014, Proceedings of the National Academy of Sciences.

[5]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[6]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  Mihai Pop,et al.  Using the TIGR assembler in shotgun sequencing projects. , 2004, Methods in molecular biology.

[8]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[9]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[10]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[11]  Robert C. Edgar,et al.  Optimizing substitution matrix choice and gap parameters for sequence alignment , 2009, BMC Bioinformatics.

[12]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[13]  Paul Flicek,et al.  Sense from sequence reads: methods for alignment and assembly , 2009, Nature Methods.

[14]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[15]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[16]  Srinivas Aluru,et al.  Space and time optimal parallel sequence alignments , 2004, IEEE Transactions on Parallel and Distributed Systems.

[17]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[18]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[19]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[20]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[21]  I. Rigoutsos,et al.  The emergence of pattern discovery techniques in computational biology. , 2000, Metabolic engineering.

[22]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[23]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[24]  Alex Thomo,et al.  Suffix trees for very large genomic sequences , 2009, CIKM.

[25]  J. Shendure,et al.  Advanced sequencing technologies: methods and goals , 2004, Nature Reviews Genetics.

[26]  C. Ouzounis Developing computational biology at meridian 23° E, and a little eastwards , 2018, Journal of Biological Research-Thessaloniki.

[27]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[28]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[29]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[30]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[31]  Michael R. Fellows,et al.  Parameterized complexity analysis in computational biology , 1995, Comput. Appl. Biosci..

[32]  Aaron L. Halpern,et al.  Consensus generation and variant detection by Celera Assembler , 2008, Bioinform..

[33]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[34]  M. Nei,et al.  Molecular Evolution and Phylogenetics , 2000 .

[35]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[36]  Gary B. Fogel,et al.  Improvement of clustal-derived sequence alignments with evolutionary algorithms , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[37]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[38]  Roy D. Sleator,et al.  An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics , 2013 .

[39]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[40]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[41]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[42]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[43]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[44]  Leen Stougie,et al.  Modes and cuts in metabolic networks: Complexity and algorithms , 2009, Biosyst..

[45]  Desmond G. Higgins,et al.  Evaluation of iterative alignment algorithms for multiple alignment , 2005, Bioinform..

[46]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[48]  Ankit Agrawal,et al.  A new heuristic for multiple sequence alignment , 2008, 2008 IEEE International Conference on Electro/Information Technology.

[49]  Balázs Papp,et al.  Systems-biology approaches for predicting genomic evolution , 2011, Nature Reviews Genetics.

[50]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[51]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[52]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[53]  M. Pop,et al.  The Theory and Practice of Genome Sequence Assembly. , 2015, Annual review of genomics and human genetics.

[54]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, Bioinform..

[55]  Fabian Sievers,et al.  Clustal Omega, accurate alignment of very large numbers of sequences. , 2014, Methods in molecular biology.

[56]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[57]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[58]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[59]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[60]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[61]  Christos A. Ouzounis,et al.  Rise and Demise of Bioinformatics? Promise and Progress , 2012, PLoS Comput. Biol..

[62]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[63]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[64]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[65]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[66]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[67]  Deanna M. Church,et al.  Building and Improving Reference Genome Assemblies , 2017, Proceedings of the IEEE.

[68]  M. Kimura Estimation of evolutionary distances between homologous nucleotide sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Christos Ouzounis,et al.  Maps, books and other metaphors for systems biology. , 2006, Bio Systems.

[70]  Mihai Pop,et al.  Comparative Genome Sequencing for Discovery of Novel Polymorphisms in Bacillus anthracis , 2002, Science.

[71]  Evan E. Eichler,et al.  Genetic variation and the de novo assembly of human genomes , 2015, Nature Reviews Genetics.

[72]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[73]  S. Henikoff,et al.  Scores for sequence searches and alignments. , 1996, Current opinion in structural biology.

[74]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[75]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[76]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[77]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[78]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[79]  L. Hood,et al.  Large-scale DNA sequencing. , 1991, Current opinion in biotechnology.

[80]  Philippe Flajolet,et al.  An introduction to the analysis of algorithms , 1995 .

[81]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[82]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.