Sequence Similarity Searching

Sequence similarity searching has become an important part of the daily routine of molecular biologists, bioinformaticians and biophysicists. With the rapidly growing sequence databanks, this computational approach is commonly applied to determine functions and structures of unannotated sequences, to investigate relationships between sequences, and to construct phylogenetic trees. We introduce arguably the most popular BLAST‐based family of the sequence similarity search tools. We explain basic concepts related to the sequence alignment and demonstrate how to search the current databanks using Web site versions of BLASTP, PSI‐BLAST and BLASTN. We also describe the standalone BLAST+ tool. Moreover, this unit discusses the inputs, parameter settings and outputs of these tools. Lastly, we cover recent advances in the sequence similarity searching, focusing on the fast MMseqs2 method. © 2018 by John Wiley & Sons, Inc.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Supratim Choudhuri Chapter 6 – Sequence Alignment and Similarity Searching in Genomic Databases: BLAST and FASTA* , 2014 .

[4]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[5]  Lukasz Kurgan,et al.  Computational Prediction of Protein Secondary Structure from Sequence , 2016, Current protocols in protein science.

[6]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[7]  Lukasz Kurgan,et al.  Comprehensive review of methods for prediction of intrinsic disorder and its molecular functions , 2017, Cellular and Molecular Life Sciences.

[8]  E. Kandel,et al.  Proceedings of the National Academy of Sciences of the United States of America. Annual subject and author indexes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[10]  Haixu Tang,et al.  RAPSearch 2 : a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011 .

[11]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[12]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[13]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[14]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[15]  Lukasz Kurgan,et al.  Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains , 2019, Briefings Bioinform..

[16]  Johannes Söding,et al.  MMseqs software suite for fast and deep clustering and searching of large protein sequence sets , 2016, Bioinform..

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[19]  Thomas L. Madden,et al.  Protein sequence similarity searches using patterns as seeds. , 1998, Nucleic acids research.

[20]  Alejandro A. Schäffer,et al.  Database indexing for production MegaBLAST searches , 2008, Bioinform..

[21]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[24]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[25]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[26]  Alejandro A. Schäffer,et al.  PSI-BLAST pseudocounts and the minimum description length principle , 2008, Nucleic acids research.

[27]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[28]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[29]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[30]  Lukasz A. Kurgan,et al.  MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins , 2012, Bioinform..

[31]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[32]  Lukasz A. Kurgan,et al.  Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources , 2010, Bioinform..

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[34]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  Lukasz A. Kurgan,et al.  Review and comparative assessment of sequence‐based predictors of protein‐binding residues , 2018, Briefings Bioinform..

[37]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[38]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[39]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[40]  S. Altschul A protein alignment scoring system sensitive at all evolutionary distances , 1993, Journal of Molecular Evolution.

[41]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[42]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[43]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[44]  Lukasz A. Kurgan,et al.  A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues , 2016, Briefings Bioinform..

[45]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[46]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[47]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[48]  Dan Wu,et al.  EMBL Nucleotide Sequence Database in 2006 , 2006, Nucleic Acids Res..

[49]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[50]  Thomas L. Madden,et al.  Domain enhanced lookup time accelerated BLAST , 2012, Biology Direct.