Data Mining Tools for Biological Sequences

We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.

[1]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[2]  F. Burden,et al.  A quantitative structure--activity relationships model for the acute toxicity of substituted benzenes to Tetrahymena pyriformis using Bayesian-regularized neural networks. , 2000, Chemical research in toxicology.

[3]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[4]  S. Brunak,et al.  Prediction of N-terminal protein sorting signals. , 1997, Current opinion in structural biology.

[5]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[6]  J. Scott,et al.  How well can radiologists using neural network software diagnose pulmonary embolism? , 2000, AJR. American journal of roentgenology.

[7]  Igor Kononenko,et al.  Machine learning in prognosis of the femoral neck fracture recovery , 1996, Artif. Intell. Medicine.

[8]  George H. John Enhancements to the data mining process , 1997 .

[9]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[10]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[11]  A. Dunker,et al.  Use of conditional probabilities for determining relationships between amino acid sequence and protein secondary structure , 1992, Proteins.

[12]  J. Garnier,et al.  Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. , 1997, Journal of molecular biology.

[13]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[14]  I. Lossos,et al.  Cerebrospinal fluid lactate dehydrogenase isoenzyme analysis for the diagnosis of central nervous system involvement in hematooncologic patients , 2000, Cancer.

[15]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[16]  Jinyan Li,et al.  Bioinformatics Adventures in Database Research , 2003, ICDT.

[17]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[18]  Mario Caria Measurement Analysis: An Introduction to the Statistical Analysis of Laboratory Data in Physics, Chemistry and the Life Sciences , 2001 .

[19]  Limsoon Wong,et al.  Using feature generation and feature selection for accurate prediction of translation initiation sites. , 2002, Genome informatics. International Conference on Genome Informatics.

[20]  David Heckerman,et al.  Bayesian Networks for Knowledge Discovery , 1996, Advances in Knowledge Discovery and Data Mining.

[21]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[22]  M. Kozak,et al.  An analysis of vertebrate mRNA sequences: intimations of translational control , 1991, The Journal of cell biology.

[23]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[24]  S. Rodenhuis,et al.  Validation of techniques for the prediction of carboplatin exposure: Application of Bayesian methods , 2000, Clinical pharmacology and therapeutics.

[25]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[26]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[27]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[28]  Anders Krogh,et al.  Chapter 4 - An introduction to hidden Markov models for biological sequences , 1998 .

[29]  D Hennessy,et al.  Statistical methods for the objective design of screening procedures for macromolecular crystallization. , 2000, Acta crystallographica. Section D, Biological crystallography.

[30]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[31]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[32]  Peggo K. W. Lam,et al.  Derivation of a prediction rule for post-traumatic acute lung injury. , 1999, Resuscitation.

[33]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[34]  S Snowden,et al.  Ruptured abdominal aortic aneurysm: a novel method of outcome prediction using neural network technology. , 2000, European journal of vascular and endovascular surgery : the official journal of the European Society for Vascular Surgery.

[35]  T Yada,et al.  Extraction of hidden Markov model representations of signal patterns in DNA sequences. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[36]  Pierre Baldi,et al.  Hidden Markov Models of the G-Protein-Coupled Receptor Family , 1994, J. Comput. Biol..

[37]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[38]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[39]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[40]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[41]  F M Debruyne,et al.  Neural network‐based digitized cell image diagnosis of bladder wash cytology , 2000, Diagnostic cytopathology.

[42]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[43]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[44]  N R Temkin,et al.  Classification and regression trees (CART) for prediction of function at 1 year following head trauma. , 1995, Journal of neurosurgery.

[45]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[46]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[47]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[48]  Saurabh Ghosh,et al.  Mapping a quantitative trait locus via the EM algorithm and Bayesian classification , 2000, Genetic epidemiology.

[49]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[50]  Ivan Bratko,et al.  Naive Bayesian-Based Nomogram for Prediction of Prostate Cancer Recurrence , 1999, MIE.

[51]  R. D'Agostino,et al.  A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. , 1995, Journal of investigative medicine : the official publication of the American Federation for Clinical Research.

[52]  Sarunas Raudys How good are support vector machines? , 2000, Neural Networks.

[53]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[54]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[55]  E. Neer G proteins: Critical control points for transmembrane signals , 1994, Protein science : a publication of the Protein Society.

[56]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[57]  Anders Krogh,et al.  Improving Predicition of Protein Secondary Structure Using Structured Neural Networks and Multiple Sequence Alignments , 1996, J. Comput. Biol..

[58]  Vincent L. Chiang,et al.  Context sequences of translation initiation codon in plants , 1997, Plant Molecular Biology.

[59]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[60]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[61]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[62]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[63]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[64]  Stuart J. Russell,et al.  Local Learning in Probabilistic Networks with Hidden Variables , 1995, IJCAI.

[65]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[66]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[67]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[68]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[69]  M S Boguski,et al.  Gene discovery in dbEST. , 1994, Science.

[70]  M. Borodovsky,et al.  Detection of new genes in a bacterial genome using Markov models for three gene classes. , 1995, Nucleic acids research.

[71]  N. Gray,et al.  Control of translation initiation in animals. , 1998, Annual review of cell and developmental biology.

[72]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[73]  M. Hendrickson,et al.  Proposed Criteria for the Diagnosis of Well‐Differentiated Endometrial Carcinoma: A Diagnostic Test for Myoinvasion , 1995, The American journal of surgical pathology.

[74]  V. Brusic,et al.  Neural network-based prediction of candidate T-cell epitopes , 1998, Nature Biotechnology.

[75]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[76]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[77]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[78]  Artemis G. Hatzigeorgiou,et al.  Translation initiation start prediction in human cDNAs with high accuracy , 2002, Bioinform..

[79]  Luciano Milanesi,et al.  Presence of ATG triplets in 5' untranslated regions of eukaryotic cDNAs correlates with a 'weak' context of the start codon , 2001, Bioinform..

[80]  C E Lawrence,et al.  Functional classification of cNMP-binding proteins and nucleotide cyclases with implications for novel regulatory pathways in Mycobacterium tuberculosis. , 2000, Genome research.

[81]  Robert J. Brooker,et al.  Genetics: Analysis and Principles , 1998 .

[82]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[83]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..

[84]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[85]  Richard H. Lathrop,et al.  Predicting Protein Structure With Probabilistic Models , 1997 .