Performance comparison of generalized PSSM in in signal peptide cleavage site and disulfide bond recognition

We generalize the familiar position-specific score matrix (PSSM), aka weight matrix, by considering a log-odds score for (nonadjacent) k-tuple frequencies, each k-tuple score weighted by the product of its mutual information and its statistical significance, as measured by a point estimator for the p-value of the mutual information. Performance of this new approach, along with other variants of generalized PSSM and profile methods, is measured by receiver-operating characteristic (ROC) curves for the specific problem of signal peptide cleavage site recognition. We additionally compare Vert's recent support vector machine string kernel, Brown's joint probability approximation algorithm and the method WAM. Similar algorithm comparisons are made, though not as extensively, in the case of disulfide bond recognition. While in the case of signal peptide cleavage site recognition, the monoresidue PSSM is essentially competitive, within the limits of statistical significance, even against Vert's support vector machine kernel, diresidue and triresidue PSSM methods display improved performance over monoresidue PSSM for disulfide bond recognition.

[1]  David T. Brown,et al.  A Note on Approximations to Discrete Probability Distributions , 1959, Inf. Control..

[2]  S. Kullback,et al.  Contingency tables with given marginals. , 1968, Biometrika.

[3]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[4]  G. Heijne A new method for predicting signal sequence cleavage sites. , 1986 .

[5]  L. Lovász Matching Theory (North-Holland mathematics studies) , 1986 .

[6]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[7]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[8]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[9]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[10]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[11]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[12]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[13]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[14]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[15]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[16]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[17]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[18]  G. Christian Overton,et al.  Oligonucleotide frequency matrices addressed to recognizing functional DNA sites , 1999, Bioinform..

[19]  P Fariselli,et al.  Role of evolutionary information in predicting the disulfide‐bonding state of cysteine in proteins , 1999, Proteins.

[20]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  Kathryn F. Beal,et al.  The Staden package, 1998. , 2000, Methods in molecular biology.

[22]  Piero Fariselli,et al.  Prediction of disulfide connectivity in proteins , 2001, Bioinform..

[23]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[24]  B. Finlay,et al.  Signal peptide cleavage in the E . coli membrane , 2002 .

[25]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[26]  Jean-Philippe Vert,et al.  Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings , 2001, Pacific Symposium on Biocomputing.

[27]  L. Mirny,et al.  Using orthologous and paralogous proteins to identify specificity determining residues. , 2002, Genome biology.

[28]  Rolf Backofen,et al.  COMPUTATIONAL MOLECULAR BIOLOGY: AN INTRODUCTION , 2000 .