Aligning amino acid sequences: Comparison of commonly used methods

SummaryWe examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of “weighting” in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical “jumbling test.” This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.

[1]  G. Braunitzer,et al.  Hämoglobine, XXVII. Die Sequenz des monomeren Hämoglobins III vonMyxine glutinosaL.: Ein neuer Hämkomplex: E7 Glutamin, E11 Isoleucin , 1979 .

[2]  F. Galibert,et al.  Nucleotide sequences of feline retroviral oncogenes (v-fes) provide evidence for a family of tyrosine-specific protein kinase genes , 1982, Cell.

[3]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[4]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[5]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[6]  M. O. Dayhoff,et al.  Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[7]  IMPLICATIONS OF MINIMAL LENGTH TREES , 1982 .

[8]  J. R. Fresco,et al.  Nucleotide Sequence , 2020, Definitions.

[9]  G. Moore,et al.  The phylogeny of human globin genes investigated by the maximum parsimony method , 1974, Journal of Molecular Evolution.

[10]  M. Yoshida,et al.  Avian sarcoma virus Y73 genome sequence and structural similarity of its transforming gene product to that of Rous sarcoma virus , 1982, Nature.

[11]  E. Reddy,et al.  Nucleotide sequence of Abelson murine leukemia virus genome: structural similarity of its transforming gene product to other onc gene products with tyrosine-specific kinase activity. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[12]  W. Fitch,et al.  An examination of the expected degree of sequence similarity that might arise in proteins that have converged to similar conformational states. The impact of such expectations on the search for homology between the structurally similar domains of rhodanese. , 1981, Journal of molecular biology.

[13]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[14]  J. Haber,et al.  An evaluation of the relatedness of proteins based on comparison of amino acid sequences. , 1970, Journal of molecular biology.

[15]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[16]  W. Gilbert,et al.  Nucleotide sequence of rous sarcoma virus , 1983, Cell.

[17]  M. Shibuya,et al.  Nucleotide sequence of Fujinami sarcoma virus: evolutionary relationship of its transforming gene with transforming genes of other sarcoma viruses , 1982, Cell.

[18]  A. Riggs,et al.  The amino acid sequence of a major polypeptide chain of earthworm hemoglobin. , 1982, The Journal of biological chemistry.

[19]  W. Hol,et al.  The covalent and tertiary structure of bovine liver rhodanese , 1978, Nature.

[20]  T. Takagi,et al.  Amino acid sequence of dimeric myoglobin from Cerithidea rhizophorarum. , 1983, Biochimica et biophysica acta.

[21]  Brian W. Kernighan,et al.  The C Programming Language , 1978 .

[22]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[23]  F H Reynolds,et al.  Structure and biological activity of v-raf, a unique oncogene transduced by a retrovirus. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[24]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[25]  Russell F. Doolittle,et al.  Nucleotide sequence and formation of the transforming gene of a mouse sarcoma virus , 1981, Nature.

[26]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[27]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[28]  A. Mclachlan,et al.  Repeating sequences and gene duplication in proteins. , 1972, Journal of molecular biology.

[29]  R. D. Wade,et al.  Complete amino acid sequence of the catalytic subunit of bovine cardiac muscle cyclic AMP-dependent protein kinase. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[30]  W. Fitch An improved method of testing for evolutionary homology. , 1966, Journal of molecular biology.

[31]  S. Reed,et al.  Primary structure homology between the product of yeast cell division control gene CDC28 and vertebrate oncogenes , 1984, Nature.

[32]  T. Takagi,et al.  Amino acid sequence of the smallest polypeptide chain containing heme of extracellular hemoglobin from the polychaete tylorrhynchus heterochaetus , 1982 .

[33]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[34]  F. Galibert,et al.  Nucleotide sequence of the feline retroviral oncogene v-fms shows unexpected homology with oncogenes encoding tyrosine-specific protein kinases. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[35]  R. Stephens,et al.  Nucleotide sequence of v-rel: the oncogene of reticuloendotheliosis virus. , 1983, Proceedings of the National Academy of Sciences of the United States of America.