论文信息 - Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix

Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix

BackgroundSelective pressures at the DNA level shape genes into profiles consisting of patterns of rapidly evolving sites and sites withstanding change. These profiles remain detectable even when protein sequences become extensively diverged. A common task in molecular biology is to infer functional, structural or evolutionary relationships by querying a database using an algorithm. However, problems arise when sequence similarity is low. This study presents an algorithm that uses the evolutionary rate at codon sites, the dN/dS (ω) parameter, coupled to a substitution matrix as an alignment metric for detecting distantly related proteins. The algorithm, called BLOSUM-FIRE couples a newer and improved version of the original FIRE (Functional Inference using Rates of Evolution) algorithm with an amino acid substitution matrix in a dynamic scoring function. The enigmatic hepatitis B virus X protein was used as a test case for BLOSUM-FIRE and its associated database EvoDB.ResultsThe evolutionary rate based approach was coupled with a conventional BLOSUM substitution matrix. The two approaches are combined in a dynamic scoring function, which uses the selective pressure to score aligned residues. The dynamic scoring function is based on a coupled additive approach that scores aligned sites based on the level of conservation inferred from the ω values. Evaluation of the accuracy of this new implementation, BLOSUM-FIRE, using MAFFT alignment as reference alignments has shown that it is more accurate than its predecessor FIRE. Comparison of the alignment quality with widely used algorithms (MUSCLE, T-COFFEE, and CLUSTAL Omega) revealed that the BLOSUM-FIRE algorithm performs as well as conventional algorithms. Its main strength lies in that it provides greater potential for aligning divergent sequences and addresses the problem of low specificity inherent in the original FIRE algorithm. The utility of this algorithm is demonstrated using the Hepatitis B virus X (HBx) protein, a protein of unknown function, as a test case.ConclusionThis study describes the utility of an evolutionary rate based approach coupled to the BLOSUM62 amino acid substitution matrix in inferring protein domain function. We demonstrate that such an approach is robust and performs as well as an array of conventional algorithms.

Scott Hazelhurst | Andrew Ndhlovu | Pierre M. Durand | S. Hazelhurst | Andrew Ndhlovu

[1] Robert R. Sokal,et al. A statistical method for evaluating systematic relationships , 1958 .

[2] Aurélien Grosdidier,et al. APDB: a novel measure for benchmarking sequence alignment methods without reference alignments , 2003, ISMB.

[3] O. Gotoh. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[4] T. Yen. Hepadnaviral X Protein:Review of Recent Progress. , 1996, Journal of biomedical science.

[5] Peter J. Munson,et al. A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[6] Joseph P Bielawski,et al. Accuracy and power of bayes prediction of amino acid sites under positive selection. , 2002, Molecular biology and evolution.

[7] Takashi Miyata,et al. Molecular evolution of mRNA: A method for estimating evolutionary rates of synonymous and amino acid substitutions from homologous nucleotide sequences and its application , 1980, Journal of Molecular Evolution.

[8] W. Wong,et al. Bayes empirical bayes inference of amino acid sites under positive selection. , 2005, Molecular biology and evolution.

[9] G J Williams,et al. The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[10] L. Pauling,et al. Evolutionary Divergence and Convergence in Proteins , 1965 .

[11] Peter H. A. Sneath,et al. Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[12] E. Abernathy,et al. Global Distribution of Rubella Virus Genotypes , 2003, Emerging infectious diseases.

[13] Yongchao Liu,et al. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[14] Elisabeth R. M. Tillier,et al. The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[15] D. Altshuler,et al. A map of human genome variation from population-scale sequencing , 2010, Nature.

[16] Yoshio Tateno,et al. Accuracy of estimated phylogenetic trees from molecular data , 2005, Journal of Molecular Evolution.

[17] Paramvir S. Dehal,et al. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[18] D. Haussler,et al. Human-mouse alignments with BLASTZ. , 2003, Genome research.

[19] K. Crandall,et al. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood , 1997 .

[20] Chuong B. Do,et al. ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[21] N. Goldman,et al. A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[22] D. Lipman,et al. Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[23] W. Fitch,et al. Construction of phylogenetic trees. , 1967, Science.

[24] Itay Mayrose,et al. ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function , 2013 .

[25] S. Karlin,et al. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[26] Yoshio Tateno,et al. Accuracy of estimated phylogenetic trees from molecular data , 1983, Journal of Molecular Evolution.

[27] P. Hogeweg,et al. The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[28] Han Liang,et al. SWAKK: a web server for detecting positive selection in proteins using a sliding window substitution rate analysis , 2006, Nucleic Acids Res..

[29] F. Cohen,et al. An evolutionary trace method defines binding surfaces common to protein families. , 1996, Journal of molecular biology.

[30] D. Lipman,et al. The multiple sequence alignment problem in biology , 1988 .

[31] J. A. Studier,et al. A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[32] S. Gillam,et al. Rubella Virus Nonstructural Protein Protease Domains Involved in trans- and cis-Cleavage Activities , 2000, Journal of Virology.

[33] Sudhir Kumar,et al. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment , 2004, Briefings Bioinform..

[34] J. Felsenstein,et al. EVOLUTIONARY TREES FROM GENE FREQUENCIES AND QUANTITATIVE CHARACTERS: FINDING MAXIMUM LIKELIHOOD ESTIMATES , 1981, Evolution; international journal of organic evolution.

[35] S Henikoff,et al. Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[36] Gajendra P. S. Raghava,et al. Quantification of the variation in percentage identity for protein sequence alignments , 2006, BMC Bioinformatics.

[37] A. Lesk,et al. The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[38] W. Pearson. Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[39] Christus,et al. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[40] S. Pongor,et al. The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[41] D. Posada,et al. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[42] J. Felsenstein,et al. An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[43] Johannes Söding,et al. Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[44] Gaston H. Gonnet,et al. Evaluation Measures of Multiple Sequence Alignments , 2000, J. Comput. Biol..

[45] K. Tamura,et al. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. , 1992, Molecular biology and evolution.

[46] Dennis R. Livesay,et al. Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[47] Adi Doron-Faigenboim,et al. Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach , 2007, Nucleic Acids Res..

[48] T. Jukes,et al. The neutral theory of molecular evolution. , 2000, Genetics.

[49] L. Cavalli-Sforza,et al. PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[50] Charles Elkan,et al. Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[51] S. Altschul. Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[52] Peter H. Sellers,et al. An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.

[53] Amos Bairoch,et al. The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[54] S. Altschul,et al. Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[55] Liisa Holm,et al. COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[56] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[57] S. Muse,et al. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[58] M. Buendia,et al. p53-independent apoptotic effects of the hepatitis B virus HBx protein in vivo and in vitro , 1998, Oncogene.

[59] Z. Yang,et al. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. , 1998, Molecular biology and evolution.

[60] Alan F. Scott,et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[61] C. Sander,et al. The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[62] C. Orengo,et al. From protein structure to function. , 1999, Current opinion in structural biology.

[63] William R. Pearson,et al. Empirical determination of effective gap penalties for sequence comparison , 2002, Bioinform..

[64] C. Seoighe,et al. Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences , 2008, Virology Journal.

[65] Tao Jiang,et al. On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[66] D. Sankoff. Minimal Mutation Trees of Sequences , 1975 .

[67] M S Waterman,et al. Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[68] Christopher D. Brown,et al. Qualifying the relationship between sequence conservation and molecular function. , 2008, Genome research.

[69] E. ten Dam,et al. Virus-encoded proteinases of the Togaviridae. , 1999, The Journal of general virology.

[70] P. Bork. Shuffled domains in extracellular proteins , 1991, FEBS letters.

[71] C. Chothia,et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[72] D. Higgins,et al. See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[73] B. Rost. Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[74] Itay Mayrose,et al. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[75] Konrad Scheffler,et al. Models of coding sequence evolution , 2008, Briefings Bioinform..

[76] Kazunori D. Yamada,et al. Revisiting amino acid substitution matrices for identifying distantly related proteins , 2013, Bioinform..

[77] Johannes Söding,et al. Discriminative modelling of context-specific amino acid substitution probabilities , 2012, Bioinform..

[78] Olivier Poch,et al. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[79] M. Bouchard,et al. The Enigmatic X Gene of Hepatitis B Virus , 2004, Journal of Virology.

[80] Leo Goodstadt,et al. CHROMA: consensus-based colouring of multiple alignments for publication , 2001, Bioinform..

[81] A. Antunes,et al. Gathering Computational Genomics and Proteomics to Unravel Adaptive Evolution , 2007, Evolutionary bioinformatics online.

[82] M. Nei,et al. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. , 1986, Molecular biology and evolution.

[83] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[84] S. Eddy,et al. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions , 2013, Nucleic acids research.

[85] O. Gascuel,et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[86] Thomas L. Madden,et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[87] H. Will,et al. Duck Hepatitis B Virus Expresses a Regulatory HBx-Like Protein from a Hidden Open Reading Frame , 2001, Journal of Virology.

[88] J. Thompson,et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[89] Liam J. Thompson. Recombinant expression and bioinformatic analysis of the Hepatitis B virus X protein , 2012 .

[90] Robert D. Finn,et al. HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[91] I. Longden,et al. EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[92] C. Aquadro,et al. A novel method to detect proteins evolving at correlated rates: identifying new functional relationships between coevolving proteins. , 2010, Molecular biology and evolution.

[93] Sergei L. Kosakovsky Pond,et al. Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology , 2010, Bioinform..

[94] Asger Hobolth,et al. Comparative analysis of protein coding sequences from human, mouse and the domesticated pig , 2005, BMC Biology.

[95] J. Felsenstein. Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[96] D Sankoff,et al. Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[97] Benjamin A. Shoemaker,et al. Finding biologically relevant protein domain interactions: Conserved binding mode analysis , 2006, Protein science : a publication of the Protein Society.

[98] Robert S. Ledley,et al. The Protein Information Resource , 2003, Nucleic Acids Res..

[99] Sergei L. Kosakovsky Pond,et al. HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[100] H. Kishino,et al. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[101] Osamu Gotoh,et al. A weighting system and algorithm for aligning many phylogenetically related sequences , 1995, Comput. Appl. Biosci..

[102] Douglas L. Brutlag,et al. The EMOTIF database , 2001, Nucleic Acids Res..

[103] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[104] Nicolas Rodriguez,et al. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees , 2005, Nucleic Acids Res..

[105] Ziheng Yang. Phylogenetic analysis using parsimony and likelihood methods , 1996, Journal of Molecular Evolution.

[106] Lloyd Allison,et al. Minimum message length encoding, evolutionary trees and multiple-alignment , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[107] Ioannis Xenarios,et al. T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension , 2011, Nucleic Acids Res..

[108] Perry G. Ridge,et al. Effects of Gap Open and Gap Extension Penalties , 2006 .

[109] J. Qin,et al. Hepatitis B Virus Regulatory HBx Protein Binds to Adaptor Protein IPS-1 and Inhibits the Activation of Beta Interferon , 2010, Journal of Virology.

[110] Michael P. Cummings,et al. PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[111] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[112] Liisa Holm,et al. Searching protein structure databases with DaliLite v.3 , 2008, Bioinform..

[113] O. Gotoh. An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[114] H. Munro,et al. Mammalian protein metabolism , 1964 .

[115] Robert C. Edgar,et al. MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[116] M. A. McClure,et al. Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[117] S. Murakami. Hepatitis B Virus X Protein: Structure, Function and Biology , 1999, Intervirology.

[118] István Miklós,et al. Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[119] Pascal Sirand-Pugnet,et al. A novel substitution matrix fitted to the compositional bias in Mollicutes improves the prediction of homologous relationships , 2011, BMC Bioinformatics.

[120] M. Nei,et al. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[121] S A Benner,et al. Amino acid substitution during functionally constrained divergent evolution of protein sequences. , 1994, Protein engineering.

[122] A. Rambaut,et al. BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[123] Maria Anisimova,et al. PANDITplus: toward better integration of evolutionary view on molecular sequences with supplementary bioinformatics resources , 2010 .

[124] Michael Brudno,et al. Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[125] John P Huelsenbeck,et al. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[126] Andrew E. Torda,et al. Not assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[127] María Martín,et al. Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[128] Susan L. Epstein,et al. Composition-Modified Matrices Improve Identification of Homologs of Saccharomyces cerevisiae Low-Complexity Glycoproteins , 2006, Eukaryotic Cell.

[129] SödingJohannes. Protein homology detection by HMM--HMM comparison , 2005 .

[130] Bartek Wilczynski,et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[131] David C. Jones,et al. CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[132] Burkhard Morgenstern,et al. DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[133] Guilherme Oliveira,et al. Assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[134] Evelyn Camon,et al. The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[135] Scott Hazelhurst,et al. Evolutionary rates at codon sites may be used to align sequences and infer protein domain function , 2010, BMC Bioinformatics.

[136] Richard Hughey,et al. Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[137] W R Taylor,et al. Protein structure alignment. , 1989, Journal of molecular biology.

[138] Lode Wyns,et al. SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[139] J. Thompson,et al. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[140] Stéphane Guindon,et al. Modeling the site-specific variation of selection patterns along lineages. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[141] J. Felsenstein. Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[142] Z. Yang,et al. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. , 2000, Molecular biology and evolution.

[143] M. O. Dayhoff,et al. Atlas of protein sequence and structure , 1965 .

[144] W. Pearson. Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[145] M. Kew. Hepatitis B virus x protein in the pathogenesis of hepatitis B virus‐induced hepatocellular carcinoma , 2011, Journal of gastroenterology and hepatology.

[146] Wenjie Tan,et al. Enhancement of Hepatitis B Virus Replication by Its X Protein in Transgenic Mice , 2002, Journal of Virology.

[147] R. Nielsen,et al. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. , 1998, Genetics.

[148] Olivier Poch,et al. A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[149] Richard J. Edwards,et al. BADASP: predicting functional specificity in protein families using ancestral sequences , 2005, Bioinform..

[150] Olivier Poch,et al. A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[151] John P. Overington,et al. HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[152] H. Varmus,et al. The molecular biology of the hepatitis B viruses. , 1987, Annual review of biochemistry.

[153] B. Rannala,et al. Molecular phylogenetics: principles and practice , 2012, Nature Reviews Genetics.

[154] BMC Bioinformatics , 2005 .

[155] Guido Rossum,et al. Python Reference Manual , 2000 .

[156] Rachel Kolodny,et al. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[157] Eugene W. Myers,et al. Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[158] David L. Wheeler,et al. GenBank , 2015, Nucleic Acids Res..

[159] A. D. McLachlan,et al. Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[160] William R. Taylor,et al. The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[161] M S Waterman,et al. Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[162] Kazutaka Katoh,et al. Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[163] R. Doolittle,et al. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[164] M. Nei,et al. Estimation of evolutionary distance between nucleotide sequences. , 1984, Molecular biology and evolution.

[165] Konrad Scheffler,et al. Evolutionary fingerprinting of genes. , 2010, Molecular biology and evolution.

[166] F. Corpet. Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[167] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[168] Nicholas L. Bray,et al. AVID: A global alignment program. , 2003, Genome research.

[169] Alexandros Stamatakis,et al. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[170] G J Barton,et al. Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[171] Mathieu Blanchette,et al. Predicting site-specific human selective pressure using evolutionary signatures , 2011, Bioinform..

[172] A Rzhetsky,et al. Tests of applicability of several substitution models for DNA sequence data. , 1995, Molecular biology and evolution.

[173] Z. Yang,et al. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. , 2001, Molecular biology and evolution.

[174] A genetic algorithm on multiple sequences alignment problems in biology , 2002, Wuhan University Journal of Natural Sciences.

[175] E. Birney,et al. Pfam: the protein families database , 2013, Nucleic Acids Res..

[176] Kevin Brick,et al. A novel series of compositionally biased substitution matrices for comparing Plasmodium proteins , 2008, BMC Bioinformatics.

[177] John P. Overington,et al. Environment‐specific amino acid substitution tables: Tertiary templates and prediction of protein folds , 1992, Protein science : a publication of the Protein Society.

[178] Maxim Teslenko,et al. MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[179] Wen-Hsiung Li,et al. The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. , 2002, Genome research.

[180] T. Liang. Hepatitis B: The virus and disease , 2009, Hepatology.

[181] Akash Ranjan,et al. Genome bias influences amino acid choices: analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome , 2008, Nucleic acids research.

[182] David T. Jones,et al. Protein superfamilles and domain superfolds , 1994, Nature.

[183] L. Holm,et al. The Pfam protein families database , 2005, Nucleic Acids Res..

[184] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[185] Ankit Agrawal,et al. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty , 2009, BMC Bioinformatics.

[186] J. Pei,et al. Multiple protein sequence alignment. , 2008, Current opinion in structural biology.

[187] Jeffrey L. Boore,et al. Gene translocation links insects and crustaceans , 1998, Nature.

[188] J. V. Moran,et al. Initial sequencing and analysis of the human genome. , 2001, Nature.

[189] Michael Gribskov,et al. Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[190] Adam Godzik,et al. Flexible structure alignment by chaining aligned fragment pairs allowing twists , 2003, ECCB.

[191] W. Taylor. A flexible method to align large numbers of biological sequences , 2005, Journal of Molecular Evolution.

[192] D. Ganem,et al. Transcriptional activation of homologous and heterologous genes by the hepatitis B virus X gene product in cells permissive for viral replication , 1989, Journal of virology.

[193] Nikolay A. Kolchanov,et al. CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences , 2004, Nucleic Acids Res..

[194] D. Higgins,et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[195] Ian Holmes,et al. Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[196] R. Doolittle. Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[197] C. Notredame,et al. Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[198] Masato Ishikawa,et al. Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[199] S. Tavaré. Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[200] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[201] Chuong B. Do,et al. Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[202] Ramakant Sharma,et al. Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood , 2003 .

[203] D T Jones,et al. Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[204] Hideaki Sugawara,et al. DDBJ progress report , 2010, Nucleic Acids Res..

[205] Ziheng Yang. PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[206] M. Waterman,et al. Line geometries for sequence comparisons , 1984 .

[207] Robert C. Edgar,et al. Optimizing substitution matrix choice and gap parameters for sequence alignment , 2009, BMC Bioinformatics.

[208] J. Richardson,et al. Simultaneous comparison of three protein sequences. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[209] B. Slagle,et al. Stimulation of Cellular Proliferation by Hepatitis B Virus X Protein , 2002, Disease markers.