A survey of genome sequence assembly techniques and algorithms using high-performance computing

Genome assembly has been an area of active research since the DNA structure was discovered and has gathered more steam after the Human Genome project was launched. A large number of genomes have been assembled and many more are in the pipeline. A number of full-scale assemblers and other special-purpose modules have been reported. Since the volume of data involved in the genome assembly process is extraordinarily large and requires significantly large computational power and processing time, many assemblers have utilized parallel computing to achieve faster and more efficient reconstruction of the DNA. A genome assembler is a multi-step process including different components that may be partly or fully parallelized. Although several assemblers and individual modules that perform various tasks, such as pairwise alignment, multiple sequence alignment, and repeat finding, have been analyzed and documented before, this paper provides a holistic view of the assembly process in the realm of parallel and distributed computing, streamlining all the individual tasks related, but not limited to, the whole genome shotgun sequencing into a sequence of loosely coupled stages where one stage consumes the output of the preceding stage and passes its results to the next one. Many of these tasks are essential to the current and next-generation sequence assemblers. The paper walks through the entire streamlined process while describing, analyzing, and commenting on the algorithms and techniques that have been designed and implemented for each of the stages. Where applicable, the paper suggests improvements that may form the basis of potentially new research work.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  Srinivas Aluru,et al.  PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[3]  Ishfaq Ahmad,et al.  A Theoretical Analysis of Scalability of the Parallel Genome Assembly Algorithms , 2011, BIOINFORMATICS.

[4]  Jian Zhong Zhang,et al.  How Capillary Electrophoresis Sequenced the Human Genome. , 2001 .

[5]  M. Guyer,et al.  Assessing the quality of the DNA sequence from the Human Genome Project. , 1999, Genome research.

[6]  Nadia Essoussi,et al.  A comparison of MSA tools , 2008, Bioinformation.

[7]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[8]  Wanlei Zhou,et al.  A parallel Euler approach for large-scale biological sequence assembly , 2005, Third International Conference on Information Technology and Applications (ICITA'05).

[9]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[10]  Vipin Kumar,et al.  Isoefficiency: measuring the scalability of parallel algorithms and architectures , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[13]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[14]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[15]  Perry L. Miller,et al.  Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm , 1991, Comput. Appl. Biosci..

[16]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[19]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[20]  Andrew K. C. Wong,et al.  A genetic algorithm for multiple molecular sequence alignment , 1997, Comput. Appl. Biosci..

[21]  Donald B. Johnson,et al.  Connected Components in O (log^3/2 n) Parallel Time for the CREW PRAM , 1997, J. Comput. Syst. Sci..

[22]  Eugene W Myers,et al.  On the sequencing and assembly of the human genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Ophir Frieder,et al.  Parallel computation in biological sequence analysis , 1998 .

[24]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[25]  Jacek Blazewicz,et al.  Parallel DNA sequence assembly , 2004, Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004..

[26]  D G Higgins,et al.  CLUSTAL V: multiple alignment of DNA and protein sequences. , 1994, Methods in molecular biology.

[27]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[28]  Nicole Rusk Torrents of sequence , 2011, Nature Methods.

[29]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[30]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[31]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[32]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[33]  E. Southern Detection of specific sequences among DNA fragments separated by gel electrophoresis. , 1975, Journal of molecular biology.

[34]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[36]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[37]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[38]  Ophir Frieder,et al.  Parallel Multiple Sequence Alignment Using Speculative Computation , 1995, ICPP.

[39]  M. Ronaghi,et al.  A Sequencing Method Based on Real-Time Pyrophosphate , 1998, Science.

[40]  D. O’Connor,et al.  Cost-effective sequence-based nonhuman primate MHC class I genotyping from RNA. , 2009, Methods.

[41]  Srinivas Aluru,et al.  Parallel biological sequence comparison using prefix computations , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[42]  Andrew Rau-Chaplin,et al.  Parallel CLUSTAL W for PC Clusters , 2003, ICCSA.

[43]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[44]  Zne-Jung Lee,et al.  Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment , 2008, Appl. Soft Comput..

[45]  Steven M. Johnson,et al.  A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. , 2008, Genome research.

[46]  GnanaSundar Rajendiran,et al.  Clustering Method for Repeat Analysis in DNA sequences , 2008 .

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[48]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[49]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[50]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[51]  Desmond G. Higgins,et al.  Analysis and Comparison of Benchmarks for Multiple Sequence Alignment , 2006, Silico Biol..

[52]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[53]  Feng Lin,et al.  pNJTree: A parallel program for reconstruction of neighbor-joining tree and its application in ClustalW , 2006, Parallel Computing.

[54]  David A. Bader,et al.  PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly , 2013, IEEE Transactions on Parallel and Distributed Systems.

[55]  Toshio Shimizu,et al.  Multiple Sequence Alignment Using a Genetic Algorithm , 1996 .

[56]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[57]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Anthony Ralston,et al.  De Bruijn Sequences—A Model Example of the Interaction of Discrete Mathematics and Computer Science , 1982 .

[59]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1974, Nature.

[60]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[61]  Raymond A. Paul,et al.  Parallel multiple sequence alignment with dynamic scheduling , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[62]  S. M. Rothstein,et al.  Structure propensities in mutated polyglutamine peptides , 2011, Interdisciplinary Sciences: Computational Life Sciences.

[63]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[64]  Srinivas Aluru,et al.  Space and time efficient parallel algorithms and software for EST clustering , 2002, Proceedings International Conference on Parallel Processing.

[65]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[66]  Tao Li,et al.  A new pheromone trail-based genetic algorithm for comparative genome assembly , 2008, Nucleic acids research.

[67]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[68]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[69]  Amitava Datta,et al.  Multiple sequence alignment in parallel on a workstation cluster , 2004, Bioinform..

[70]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[71]  E. Li,et al.  Parallel implementation and performance characterization of MUSCLE , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[72]  Guang R. Gao,et al.  A Multithreaded Parallel Implementation of a Dynamic Programming Algorithm for Sequence Comparison , 2000, Pacific Symposium on Biocomputing.

[73]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[74]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[75]  Michael Brudno,et al.  PROBCONS: Probabilistic Consistency-Based Multiple Alignment of Amino Acid Sequences , 2004, AAAI.