Multiple Sequence Alignment Based on a Suffix Tree and Center-Star Strategy: A Linear Method for Multiple Nucleotide Sequence Alignment on Spark Parallel Framework

Multiple sequence alignment (MSA) is an essential prerequisite and dominant method to deduce the biological facts from a set of molecular biological sequences. It refers to a series of algorithmic solutions for the alignment of evolutionarily related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. These methods can be applied to DNA, RNA, or protein sequences. In this work, we take advantage of a center-star strategy to reduce the MSA problem to pairwise alignments, and we use a suffix tree to match identical substrings between two pairwise sequences. Multiple sequence alignment based on a suffix tree and center-star strategy (MASC) can accomplish MSA in O(mn), which is linear time complexity, where m is the number of sequences and n is the average length of sequences. Furthermore, we execute our method on the Spark-distributed parallel framework to deal with ever-increasing massive data sets. Our method is significantly faster than previous techniques, with no loss in accuracy for highly similar nucleotide sequences like homologous sequences, which we experimentally demonstrate. Comparing with mainstream MSA tools (e.g., MAFFT), MASC could finish the alignment of 67,200 sequences, longer than 10,000 bps, in 9 minutes, which takes MAFFT >3.5 days.

[1]  Hidetoshi Shimodaira,et al.  Mitochondrial genome variation in eastern Asia and the peopling of Japan. , 2004, Genome research.

[2]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[3]  Michael J. Keiser,et al.  Large Scale Prediction and Testing of Drug Activity on Side-Effect Targets , 2012, Nature.

[4]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[5]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[6]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[7]  Desmond G. Higgins,et al.  Evaluation of iterative alignment algorithms for multiple alignment , 2005, Bioinform..

[8]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[9]  Qinghua Hu,et al.  HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy , 2015, Bioinform..

[10]  Richard E. Overill,et al.  Heterogeneous Computing Machines and Amdahl's Law , 1996, Parallel Comput..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Hiroshi Kimura,et al.  The functional organization of mitochondrial genomes in human cells , 2004, BMC Biology.

[15]  Zhang Tao-tao,et al.  An Algorithm for DNA Multiple Sequence Alignment Based on Center Star Method and Keyword Tree , 2009 .

[16]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[17]  Gaston H. Gonnet,et al.  Fast text searching for regular expressions or automaton searching on tries , 1996, JACM.

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[20]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[21]  W R Taylor,et al.  Hierarchical method to align large numbers of biological sequences. , 1990, Methods in enzymology.

[22]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[23]  Yi Jiang,et al.  A Novel Center Star Multiple Sequence Alignment Algorithm Based on Affine Gap Penalty and K-Band , 2012 .

[24]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[25]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[26]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[27]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[28]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .