Computational Biology and Language

Current scientific research is characterized by increasing specialization, accumulating knowledge at a high speed due to parallel advances in a multitude of sub-disciplines. Recent estimates suggest that human knowledge doubles every two to three years – and with the advances in information and communication technologies, this wide body of scientific knowledge is available to anyone, anywhere, anytime. This may also be referred to as ambient intelligence – an environment characterized by plentiful and available knowledge. The bottleneck in utilizing this knowledge for specific applications is not accessing but assimilating the information and transforming it to suit the needs for a specific application. The increasingly specialized areas of scientific research often have the common goal of converting data into insight allowing the identification of solutions to scientific problems. Due to this common goal, there are strong parallels between different areas of applications that can be exploited and used to cross-fertilize different disciplines. For example, the same fundamental statistical methods are used extensively in speech and language processing, in materials science applications, in visual processing and in biomedicine. Each sub-discipline has found its own specialized methodologies making these statistical methods successful to the given application. The unification of specialized areas is possible because many different problems can share strong analogies, making the theories developed for one problem applicable to other areas of research. It is the goal of this paper to demonstrate the utility of merging two disparate areas of applications to advance scientific research. The merging process requires cross-disciplinary collaboration to allow maximal exploitation of advances in one sub-discipline for that of another. We will demonstrate this general concept with the specific example of merging language technologies and computational biology.

[1]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[2]  Peter G. Wolynes,et al.  Biomolecules: Where the Physics of Complexity and Simplicity Meet , 1994 .

[3]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[4]  Xiaoyong Zou,et al.  Prediction of Transmembrane Proteins Based on the Continuous Wavelet Transform , 2004, J. Chem. Inf. Model..

[5]  S Rackovsky,et al.  On the properties and sequence context of structurally ambivalent fragments in proteins , 2003, Protein science : a publication of the Protein Society.

[6]  E. B. Newman,et al.  Tests of a statistical explanation of the rank-frequency relation for words in written English. , 1958, American Journal of Psychology.

[7]  John G. Proakis,et al.  Digital Signal Processing: Principles, Algorithms, and Applications , 1992 .

[8]  A A Tsonis,et al.  Is DNA a language? , 1997, Journal of theoretical biology.

[9]  John G. Proakis,et al.  Digital signal processing (2nd ed.): principles, algorithms, and applications , 1992 .

[10]  R. Durbin,et al.  Enhanced protein domain discovery by using language modeling techniques from speech recognition , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Judith Klein-Seetharaman,et al.  BLMT: statistical sequence analysis using N-grams. , 2004, Applied bioinformatics.

[12]  S. J. Press,et al.  Review: Yvonne M. M. Bishop, Stephen E. Fienberg and Paul W. Holland, Discrete multivariate analysis: Theory and practice , 1978 .

[13]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[14]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[15]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[16]  Minyue Fu,et al.  The use of wavelet transforms in phoneme recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  G. Heijne Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[18]  E. Trifonov,et al.  Enhancement of the nucleosomal pattern in sequences of lower complexity. , 1997, Nucleic acids research.

[19]  Stavros J. Hamodrakas,et al.  waveTM: Wavelet-based transmembrane segment prediction , 2004, Silico Biol..

[20]  L. Wasserman,et al.  Exponential Language Models, Logistic Regression, and Semantic Coherence , 2000 .

[21]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[22]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[23]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[24]  J. Klein-Seetharaman,et al.  Yule Value Tables from Protein Datasets , 2004 .

[25]  Jaime G. Carbonell,et al.  Comparative ngram analysis of whole-genome sequences , 2002 .

[26]  N. Balakrishnan,et al.  Characterization of protein secondary structure , 2004, IEEE Signal Processing Magazine.

[27]  Wentian Li,et al.  Statistical Properties of Open Reading Frames in Complete Genome Sequences , 1999, Comput. Chem..

[28]  Gad M. Landau,et al.  Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity , 2002, Bioinform..

[29]  Alfredo Colosimo,et al.  Nonlinear signal analysis methods in the elucidation of protein sequence-structure relationships. , 2002, Chemical reviews.

[30]  Johan Wouters,et al.  Wavpred: A Wavelet-Based Algorithm for the Prediction of Transmembrane Proteins , 2003 .

[31]  Jaime G. Carbonell,et al.  Comparison of probabilistic combination methods for protein secondary structure prediction , 2004, Bioinform..

[32]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[33]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[34]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[35]  Per Jambeck,et al.  Developing Bioinformatics Computer Skills , 2001 .

[36]  Judith Klein-Seetharaman,et al.  A Sequence Alignment-Independent Method for Protein Classification , 2004, Applied bioinformatics.

[37]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[38]  Pietro Liò,et al.  Wavelet change-point prediction of transmembrane proteins , 2000, Bioinform..

[39]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[40]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .

[41]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[42]  Judith Klein-Seetharaman,et al.  Identification of fundamental building blocks in protein sequences using statistical association measures , 2004, SAC '04.

[43]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[44]  Chan,et al.  Can Zipf distinguish language from noise in noncoding DNA? , 1996, Physical review letters.

[45]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[46]  Lorna J. Smith,et al.  Long-Range Interactions Within a Nonnative Protein , 2002, Science.

[47]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[48]  Alexander Bolshoy,et al.  DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity. , 2003, Applied bioinformatics.

[49]  G von Heijne,et al.  Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[50]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[51]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[52]  Jaime G. Carbonell,et al.  Rare and Frequent N-grams in Whole-genome Protein Sequences , 2002 .

[53]  Richard Bonneau,et al.  Ab initio protein structure prediction of CASP III targets using ROSETTA , 1999, Proteins.

[54]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[55]  Anna C. Gilbert,et al.  Robust speech recognition using wavelet coefficient features , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[56]  D Larhammar,et al.  Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis. , 1996, Nucleic acids research.

[57]  A K Konopka,et al.  Noncoding DNA, Zipf's law, and language. , 1995, Science.

[58]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[59]  W. Cramer,et al.  Membrane protein structure prediction: cytochrome b. , 1991, Trends in biochemical sciences.