Automated motif discovery in protein structure prediction

The protein structure prediction problem (PSP) is one of the central problems in molecular and structural biology. A computational method that could produce a correct detailed three-dimensional structural model for a protein, given its linear sequence of amino acids, would greatly accelerate progress in the biomedical sciences and industries. This thesis presents PSP as a combinatorial optimization problem, the most straightforward formulations of which require search of an exponentially-large conformation space and are known to be NP-Hard. This otherwise intractable search can in practice be reduced or eliminated through the discovery and use of motifs. Motifs are abstractions of observed patterns that encode structurally important relationships among constituent parts of a complex object like a protein tertiary structure. Motif discovery is accomplished by particular combinatorial search and statistical estimation methods. This thesis explores in detail two particular motif discovery subproblems, and discusses how their solutions can be applied to the overall structure prediction problem: (1) For a complex multi-stage prediction task, what makes a good intermediate representation language? We address this question by presenting and analyzing methods for the discovery of protein secondary structure classes that are more predictable from amino acid sequence than the standard classes of $\alpha$-helix, $\beta$-sheet, and "random coil". (2) Given a database of M objects, each characterized by values $a\sb{ij}\in {\cal A}\sb{j}$ for each of N discrete variables $\{c\sb{j}\}\sbsp{j=1}{N},$ return the list of "most interesting" higher-order features $\gamma\sb{l},$ i.e., sets of $k\sb{l}$ variables with highest estimated correlation, for any $2 \le k\sb{l} \le N$. In the PSP context, the problem is the detection of correlations between amino acid residues in an aligned set of evolutionarily-related protein sequences. We present and analyze a fast procedure, based on multinomial sampling and a novel coding scheme, that avoids the exhaustive search, prior limits on the order k, and exponentially large parameter space of other methods. The focus of this thesis is PSP, but the techniques and analysis are also aimed at wider application to other hard, multi-stage prediction problems.

[1]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[2]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Solomon Kullback,et al.  Approximating discrete probability distributions , 1969, IEEE Trans. Inf. Theory.

[5]  D. Baltimore Viral RNA-dependent DNA Polymerase: RNA-dependent DNA Polymerase in Virions of RNA Tumour Viruses , 1970, Nature.

[6]  N. JARDINE,et al.  A New Approach to Pattern Recognition , 1971, Nature.

[7]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[8]  H B Barlow,et al.  Single units and sensation: a neuron doctrine for perceptual psychology? , 1972, Perception.

[9]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[10]  H. Scheraga,et al.  Experimental and theoretical aspects of protein folding. , 1975, Advances in protein chemistry.

[11]  C. Chothia The nature of the accessible and buried surfaces in proteins. , 1976, Journal of molecular biology.

[12]  C. Chothia,et al.  Structure of proteins: packing of alpha-helices and pleated sheets. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[13]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[14]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[15]  W. Ebeling,et al.  On grammars, complexity, and information measures of biological macromolecules , 1980 .

[16]  R Staden,et al.  An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. , 1982, Nucleic acids research.

[17]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[18]  Timothy F. Havel,et al.  The theory and practice of distance geometry , 1983, Bulletin of Mathematical Biology.

[19]  J. M. Thornton,et al.  Prediction of super-secondary structure in proteins , 1983, Nature.

[20]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[21]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[22]  C. Chothia Principles that determine the structure of proteins. , 1984, Annual review of biochemistry.

[23]  W R Taylor,et al.  Recognition of super-secondary structure in proteins. , 1984, Journal of molecular biology.

[24]  C Sander,et al.  On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Satosi Watanabe,et al.  Pattern Recognition: Human and Mechanical , 1985 .

[26]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[27]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[28]  Barak A. Pearlmutter,et al.  G-maximization: An unsupervised learning procedure for discovering regularities , 1987 .

[29]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[30]  P. Wolynes,et al.  Spin glasses and the statistical mechanics of protein folding. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[31]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[32]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[34]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[35]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[36]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[37]  D. Turner,et al.  RNA structure prediction. , 1988, Annual review of biophysics and biophysical chemistry.

[38]  W. F. Gunsteren,et al.  The role of computer simulation techniques in protein engineering , 1988 .

[39]  Jacob V. Maizel,et al.  Discriminant analysis of promoter regions in Escherichia coli sequences , 1988, Comput. Appl. Biosci..

[40]  W R Taylor,et al.  Pattern matching methods in protein sequence comparison and structure prediction. , 1988, Protein engineering.

[41]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[42]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[43]  J. Rissanen Stochastic Complexity and the Maximum Entropy Principle , 1988 .

[44]  David B. Searls Representing Genetic Information with Formal Grammars , 1988, AAAI.

[45]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[46]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[47]  T. G. Marr,et al.  Computational approaches to discovering semantics in molecular biology , 1989 .

[48]  Geoffrey E. Hinton Deterministic Boltzmann Learning Performs Steepest Descent in Weight-Space , 1989, Neural Computation.

[49]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[50]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[51]  K. Kuwajima,et al.  The molten globule state as a clue for understanding the folding and cooperativity of globular‐protein structure , 1989, Proteins.

[52]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[53]  J. Garnier,et al.  The GOR Method for Predicting Secondary Structures in Proteins , 1989 .

[54]  J L Sussman,et al.  A 3D building blocks approach to analyzing and predicting structure of proteins , 1989, Proteins.

[55]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[56]  Sanguthevar Rajasekaran,et al.  The light bulb problem , 1995, COLT '89.

[57]  M. Zuker Computer prediction of RNA structure. , 1989, Methods in enzymology.

[58]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[59]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[60]  S H Kim,et al.  Predicting surface exposure of amino acids from protein sequence. , 1990, Protein engineering.

[61]  N. D. Clarke,et al.  Identification of protein folds: Matching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures , 1990, Proteins.

[62]  T L South,et al.  Zinc fingers. , 1990, Advances in inorganic biochemistry.

[63]  W. Lim,et al.  Deciphering the message in protein sequences: tolerance to amino acid substitutions. , 1990, Science.

[64]  Lila M. Gierasch,et al.  Protein Folding: Deciphering the Second Half of the Genetic Code , 1990 .

[65]  K. Dill,et al.  The effects of internal constraints on the configurations of chain molecules , 1990 .

[66]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[67]  Steven M. Muskal,et al.  Prediction of the disulfide-bonding state of cysteine in proteins. , 1990, Protein engineering.

[68]  C. Sander,et al.  Database algorithm for generating protein backbone and side-chain co-ordinates from a C alpha trace application to model building and detection of co-ordinate errors. , 1991, Journal of molecular biology.

[69]  A Kolinski,et al.  Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure and dynamics. , 1991, Journal of molecular biology.

[70]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[71]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[72]  L T Hunt,et al.  The PIR protein sequence database. , 1991, Nucleic acids research.

[73]  J Moult,et al.  An analysis of protein folding pathways. , 1991, Biochemistry.

[74]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[75]  B Efron,et al.  Statistical Data Analysis in the Computer Age , 1991, Science.

[76]  S. Subbiah,et al.  Prediction of protein side-chain conformation by packing optimization. , 1991, Journal of molecular biology.

[77]  D. States,et al.  Efficient Classification of Massive, Unsegmented Datastreams , 1992, ML.

[78]  J T Ngo,et al.  Computational complexity of a problem in molecular structure prediction. , 1992, Protein engineering.

[79]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[80]  M. Levitt,et al.  A lattice model for protein structure prediction at low resolution. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[81]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[82]  P Stolorz,et al.  Predicting protein secondary structure using neural net and statistical methods. , 1992, Journal of molecular biology.

[83]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[84]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[85]  D. Mackay,et al.  Bayesian methods for adaptive models , 1992 .

[86]  G. Stormo,et al.  Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. , 1992, Nucleic acids research.

[87]  T. P. Flores,et al.  Recurring structural motifs in proteins with different functions , 1993, Current Biology.

[88]  Ron Unger,et al.  Finding the lowest free energy conformation of a protein is an NP-hard problem: Proof and implications , 1993 .

[89]  Virginia R. de Sa,et al.  Learning Classification with Unlabeled Data , 1993, NIPS.

[90]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[91]  K. Dill Folding proteins: finding a needle in a haystack , 1993 .

[92]  F. Crick,et al.  Molecular structure of nucleic acids , 2004, JAMA.

[93]  Joachim M. Buhmann,et al.  Complexity Optimized Data Clustering by Competitive Neural Networks , 1993, Neural Computation.

[94]  K. Dill,et al.  Cooperativity in protein-folding kinetics. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[95]  Robert M. Farber,et al.  Neural Network Definition of Highly Predictable Protein Secondary Structure Classes , 1993, NIPS.

[96]  A. Horovitz,et al.  Prediction of an inter-residue interaction in the chaperonin GroEL from multiple sequence alignment is confirmed by double-mutant cycle analysis. , 1994, Journal of molecular biology.

[97]  Zoubin Ghahramani,et al.  Factorial Learning and the EM Algorithm , 1994, NIPS.

[98]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[99]  Tim J. P. Hubbard Use of beta-strand Interaction Pseudo-Potentials in Protein Structure Prediction and Modeling , 1994, HICSS.

[100]  Pierre Baldi,et al.  Hidden Markov Models of the G-Protein-Coupled Receptor Family , 1994, J. Comput. Biol..

[101]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[102]  S. ODonoghue,et al.  Computational tools for experimental determination and theoretical prediction of protein structure , 1995, ISMB 1995.

[103]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[104]  R. Srinivasan,et al.  LINUS: A hierarchic procedure to predict the fold of a protein , 1995, Proteins.

[105]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[106]  D. Mackay,et al.  Bayesian neural networks and density networks , 1995 .

[107]  Richard H. Lathrop,et al.  From Electron Density and Sequence to Structure: Integrating Protein Image Analysis and Threading for Structure Determination , 1996, ISMB.

[108]  L. B. Almeida,et al.  An Objective Function for Independence , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[109]  C. S. Wallace,et al.  Circular clustering of protein dihedral angles by Minimum Message Length. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[110]  Heikki Mannila,et al.  Fast Discovery of Association Rules in Large Databases , 1996, Knowledge Discovery and Data Mining.

[111]  David Heckerman,et al.  Bayesian Networks for Knowledge Discovery , 1996, Advances in Knowledge Discovery and Data Mining.

[112]  Temple F. Smith,et al.  Global optimum protein threading with gapped alignment and empirical pair score functions. , 1996, Journal of molecular biology.

[113]  Madhu Sudan,et al.  A statistical perspective on data mining , 1997, Future Gener. Comput. Syst..

[114]  K Satou,et al.  Finding association rules on heterogeneous genome data. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[115]  Mihalis Yannakakis,et al.  On the Complexity of Protein Folding , 1998, J. Comput. Biol..

[116]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .