Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site

When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair‐wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem‐specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[3]  J. Silver,et al.  The heavy chain of human B-cell alloantigen HLA-DS has a variable N-terminal region and a constant immunoglobulin-like region , 1983, Nature.

[4]  G von Heijne,et al.  Patterns of amino acids near signal-sequence cleavage sites. , 1983, European journal of biochemistry.

[5]  G. Heijne A new method for predicting signal sequence cleavage sites. , 1986 .

[6]  G. von Heijne,et al.  A new method for predkting signal sequence cleavage sites , 2022 .

[7]  I. Mellman,et al.  Isolation and expression of cDNA clones encoding a human receptor for IgG (Fc gamma RII) , 1987, The Journal of experimental medicine.

[8]  T. Silhavy,et al.  The first 28 amino acids of mature LamB are required for rapid and efficient export from the cytoplasm. , 1987, Genes & development.

[9]  S. Liebhaber,et al.  Two distinct species of human growth hormone-variant mRNA in the human placenta predict the expression of novel growth hormone proteins. , 1988, The Journal of biological chemistry.

[10]  M. Hibbs,et al.  Molecular cloning of a human immunoglobulin G Fc receptor. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[11]  B. Seed,et al.  The Fcγ receptor of natural killer cells is a phospholipid-linked membrane protein , 1988, Nature.

[12]  D. States,et al.  Structure of the human neutrophil elastase gene. , 1988, The Journal of biological chemistry.

[13]  I. Stamenkovic,et al.  Isolation of cDNAs for two distinct human Fc receptors by ligand affinity cloning. , 1988, The EMBO journal.

[14]  G von Heijne,et al.  The structure of signal peptides from bacterial lipoproteins. , 1989, Protein engineering.

[15]  G. Barsh,et al.  Human Fc gamma RIII: cloning, expression, and identification of the chromosomal locus of two Fc receptors for IgG. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[16]  G. Salvesen,et al.  The human neutrophil elastase gene. Analysis of the nucleotide sequence reveals three distinct classes of repetitive DNA. , 1989, Biological chemistry Hoppe-Seyler.

[17]  J. Ravetch,et al.  Structure and expression of human IgG FcRII(CD32). Functional heterogeneity is encoded by the alternatively spliced products of multiple genes , 1989, The Journal of experimental medicine.

[18]  G von Heijne,et al.  Species‐specific variation in signal peptide design Implications for protein secretion in foreign hosts , 1989, FEBS letters.

[19]  B. Perussia,et al.  Alternative membrane forms of Fc gamma RIII(CD16) on human natural killer cells and neutrophils. Cell type-specific expression of two genes that differ in single nucleotide substitutions , 1989, The Journal of experimental medicine.

[20]  I. Mellman,et al.  Human IgG Fc receptor (hFcRII; CD32) exists as multiple isoforms in macrophages, lymphocytes and IgG‐transporting placental epithelium. , 1989, The EMBO journal.

[21]  Gunnar von Heijne,et al.  The structure of signal peptides from bacterial lipoproteins. , 1989 .

[22]  B. Seed,et al.  The FCγ receptor of natural killer cells is a phospholipid-linked membrane protein , 1989, Nature.

[23]  G. von Heijne The signal peptide. , 1990, The Journal of membrane biology.

[24]  J. Frey,et al.  Distribution, inducibility and biological function of the cloned and expressed human βFc receptor II , 1990, European journal of immunology.

[25]  Functional expression of human leukocyte elastase (HLE)/medullasin in eukaryotic cells. , 1990, Biochemical and biophysical research communications.

[26]  G. Vonheijne The signal peptide. , 1990 .

[27]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[28]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[29]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[30]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[31]  G von Heijne,et al.  A 30-residue-long "export initiation domain" adjacent to the signal sequence is critical for protein translocation across the inner membrane of Escherichia coli. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Knudsen,et al.  G+C-rich tract in 5' end of human introns. , 1992, Journal of molecular biology.

[33]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[34]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[35]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[36]  Lutz Prechelt,et al.  A study of experimental evaluations of neural network learning algorithms: current research practice , 1994 .

[37]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[38]  Human FcyRIII : Cloning , expression , and identification of the chromosomal locus of two Fc receptors for IgG , .