Machine learning methods for transcription data integration

Gene expression is modulated by transcription factors (TFs), which are proteins that generally bind to DNA adjacent to coding regions and initiate transcription. Each target gene can be regulated by more than one TF, and each TF can regulate many targets. For a complete molecular understanding of transcriptional regulation, researchers must first associate each TF with the set of genes that it regulates. Here we present a summary of completed work on the ability to associate 104 TFs with their binding sites using support vector machines (SVMs), which are classification algorithms based in statistical learning theory. We use several types of genomic datasets to train classifiers in order to predict TF binding in the yeast genome. We consider motif matches, subsequence counts, motif conservation, functional annotation, and expression profiles. A simple weighting scheme varies the contribution of each type of genomic data when building a final SVM classifier, which we evaluate using known binding sites published in the literature and in online databases. The SVM algorithm works best when all datasets are combined, producing 73% coverage of known interactions, with a prediction accuracy of almost 0.9. We discuss new ideas and preliminary work for improving SVM classification of biological data.

[1]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[2]  James I. Garrels,et al.  The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data , 1999, Nucleic Acids Res..

[3]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[4]  Zhiping Weng,et al.  PromoSer: a large-scale mammalian promoter and transcription start site identification service , 2003, Nucleic Acids Res..

[5]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[6]  E. Mauceli,et al.  The genome sequence of the filamentous fungus Neurospora crassa , 2003, Nature.

[7]  G. Church,et al.  Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. , 2002, Journal of molecular biology.

[8]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[9]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[10]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  William Stafford Noble,et al.  Support vector machine classification on the web , 2004, Bioinform..

[12]  Mark J. van der Laan,et al.  Regulatory motif finding by logic regression , 2004, Bioinform..

[13]  D. Goodsell,et al.  Bending and curvature calculations in B-DNA. , 1994, Nucleic acids research.

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[16]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[17]  D. Shasha,et al.  cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. , 2001, Genome research.

[18]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[19]  Andreas Prlic,et al.  Ensembl 2006 , 2005, Nucleic Acids Res..

[20]  Mark Rebeiz,et al.  SCORE: A computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[22]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[23]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[24]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[25]  Feng Gao,et al.  Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data , 2004, BMC Bioinformatics.

[26]  David Botstein,et al.  A systematic approach to reconstructing transcription networks in Saccharomyces cerevisiae , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[28]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[29]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[30]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[31]  Olivier Elemento,et al.  Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach , 2005, Genome Biology.

[32]  T. Tullius,et al.  DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Mark Gerstein,et al.  Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data , 2003, Bioinform..

[35]  Michael A. Beer,et al.  Whole-genome discovery of transcription factor binding sites by network-level conservation. , 2003, Genome research.

[36]  K.-C. Chou,et al.  Using string kernel to predict signal peptide cleavage site based on subsite coupling model , 2005, Amino Acids.

[37]  Deendayal Dinakarpandian,et al.  Tandem machine learning for the identification of genes regulated by transcription factors , 2005, BMC Bioinformatics.

[38]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[39]  Lyal B. Harris November , 1890, The Hospital.

[40]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[41]  Shoshana J. Wodak,et al.  Combining pattern discovery and discriminant analysis to predict gene co-regulation , 2004, Bioinform..

[42]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[43]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[44]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[45]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[46]  M. Kon,et al.  Integrating genomic data to predict transcription factor binding. , 2005, Genome informatics. International Conference on Genome Informatics.

[47]  G. Stormo,et al.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[48]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[49]  B. Palsson Systems Biology: Transcriptional Regulatory Networks , 2006 .

[50]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[51]  Yves Deville,et al.  The aMAZE LightBench: a web interface to a relational database of cellular processes , 2004, Nucleic Acids Res..

[52]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[53]  B. De Moor,et al.  Toucan: deciphering the cis-regulatory logic of coregulated genes. , 2003, Nucleic acids research.

[54]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[55]  Af Smit,et al.  RepeatMasker software program (computer program), ver. 3.1.8. Seattle: Institute for Systems Biology. , 2007 .