A primer in macromolecular linguistics

Polymeric macromolecules, when viewed abstractly as strings of symbols, can be treated in terms of formal language theory, providing a mathematical foundation for characterizing such strings both as collections and in terms of their individual structures. In addition this approach offers a framework for analysis of macromolecules by tools and conventions widely used in computational linguistics. This article introduces the ways that linguistics can be and has been applied to molecular biology, covering the relevant formal language theory at a relatively nontechnical level. Analogies between macromolecules and human natural language are used to provide intuitive insights into the relevance of grammars, parsing, and analysis of language complexity to biology. © 2012 Wiley Periodicals, Inc. Biopolymers 99: 203–217, 2013.

[1]  S. Vetter,et al.  Novel aspects of calmodulin target recognition and activation. , 2003, European journal of biochemistry.

[2]  Jaap Heringa,et al.  Protein secondary structure prediction. , 2010, Methods in molecular biology.

[3]  Jean-Christophe Nebel,et al.  A stochastic context free grammar based framework for analysis of protein sequences , 2009, BMC Bioinformatics.

[4]  D. Searls,et al.  Gene structure prediction by linguistic methods. , 1994, Genomics.

[5]  Arne Elofsson,et al.  Expansion of Protein Domain Repeats , 2006, PLoS Comput. Biol..

[6]  Hiroshi Matsui,et al.  Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[7]  Ian Holmes,et al.  Evolutionary Modeling and Prediction of Non-Coding RNAs in Drosophila , 2009, PloS one.

[8]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[9]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[10]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[11]  Mathieu Blanchette,et al.  A Probabilistic Model for Sequence Alignment with Context-Sensitive Indels , 2011, RECOMB.

[12]  Sylvie Hamel,et al.  Modeling RNA tertiary structure motifs by graph-grammars , 2007, Nucleic acids research.

[13]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[14]  Ernesto Picardi,et al.  Computational methods for ab initio and comparative gene finding. , 2010, Methods in molecular biology.

[15]  R. Aroul Selvam,et al.  omIns: a web resource for domain insertions in known protein structures , 2004, Nucleic Acids Res..

[16]  D. Eisenberg,et al.  Domain swapping: entangling alliances between proteins. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  David B. Searls,et al.  Grammatical Representations of Macromolecular Structure , 2006, J. Comput. Biol..

[18]  Andreas Prlić,et al.  Circular Permutation in Proteins , 2012, PLoS computational biology.

[19]  Rajgopal Srinivasan,et al.  Recursive domains in proteins , 2002, Protein science : a publication of the Protein Society.

[20]  Aravind K. Joshi,et al.  Computational linguistics: A new tool for exploring biopolymer structures and statistical mechanics , 2007 .

[21]  Mathematical Support for Molecular Biology, Proceedings from the DIMACS Special Year, 1998 , 1999, Mathematical Support for Molecular Biology.

[22]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[23]  Mario Gimona,et al.  Protein linguistics — a grammar for modular protein assembly? , 2006, Nature Reviews Molecular Cell Biology.

[24]  Simon J. Greenhill,et al.  Language evolution and human history: what a difference a date makes , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  Christian N. S. Pedersen,et al.  RNA Pseudoknot Prediction in Energy-Based Models , 2000, J. Comput. Biol..

[26]  Christopher M Dobson,et al.  Principles of protein folding, misfolding and aggregation. , 2004, Seminars in cell & developmental biology.

[27]  S. Eddy,et al.  A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more. , 2012, RNA.

[28]  Bjarne Knudsen,et al.  Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars , 2003 .

[29]  David B. Searls,et al.  From Jabberwocky to Genome: Lewis Carroll and Computational Biology , 2001, J. Comput. Biol..

[30]  Johannes Söding,et al.  On the origin of the histone fold , 2007, BMC Structural Biology.

[31]  Peer Bork,et al.  SMART 7: recent updates to the protein domain annotation resource , 2011, Nucleic Acids Res..

[32]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[33]  Markus E. Nebel,et al.  Evaluation of a sophisticated SCFG design for RNA secondary structure prediction , 2011, Theory in Biosciences.

[34]  S R Eddy,et al.  Computational analysis of RNAs. , 2006, Cold Spring Harbor symposia on quantitative biology.

[35]  C. Levinthal Molecular model-building by computer. , 1966, Scientific American.

[36]  M. Jaskólski,et al.  Crystal structure of human cystatin C stabilized against amyloid formation , 2010, The FEBS journal.

[37]  Benoit H. Dessailly,et al.  Exploiting structural classifications for function prediction: towards a domain grammar for protein function. , 2009, Current opinion in structural biology.

[38]  Michal Ziv-Ukelson,et al.  Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach , 2010, Algorithms for Molecular Biology.

[39]  Julia Hockenmaier,et al.  Routes are trees: The parsing perspective on protein folding , 2006, Proteins.

[40]  D. B. Searls,et al.  Reading the book of life , 2001, Bioinform..

[41]  K. Dill Theory for the folding and stability of globular proteins. , 1985, Biochemistry.