KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules. Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets. Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/ Contact: sebi@tuebingen.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[2]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[3]  Gunnar Rätsch,et al.  POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors , 2008, ISMB.

[4]  P. Walker,et al.  Evolution of motif variants and positional bias of the cyclic-AMP response element , 2007, BMC Evolutionary Biology.

[5]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[6]  Jason Weston,et al.  Large-Scale Learning with String Kernels , 2007 .

[7]  Kathleen Marchal,et al.  INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling , 2002, Bioinform..

[8]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[9]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[10]  Rainer Merkl,et al.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[11]  Roded Sharan,et al.  A Discriminative Model for Identifying Spatial cis-Regulatory Modules , 2005, J. Comput. Biol..

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[14]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[15]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[16]  Wolfgang Busch,et al.  WUSCHEL controls meristem function by direct regulation of cytokinin-inducible response regulators , 2005, Nature.

[17]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[18]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[19]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[20]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[21]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[22]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[23]  Kanako O. Koyanagi,et al.  Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana. , 2007, Genome research.

[24]  Wolfgang Busch,et al.  Identification of novel heat shock factor-dependent genes and biochemical pathways in Arabidopsis thaliana. , 2004, The Plant journal : for cell and molecular biology.

[25]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[26]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[27]  Masato Ishikawa,et al.  Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences , 1998, Bioinform..

[28]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[29]  Jun S. Liu,et al.  De novo cis-regulatory module elicitation for eukaryotic genomes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[31]  Stephen M. Mount,et al.  The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) , 2008, Nature.

[32]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[33]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[34]  Roded Sharan,et al.  A discriminative model for identifying spatial cis-regulatory modules , 2004, J. Comput. Biol..

[35]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[36]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[37]  Alexander J. Hartemink,et al.  A Fast, Alignment-Free, Conservation-Based Method for Transcription Factor Binding Site Discovery , 2008, RECOMB.

[38]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[39]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[40]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[41]  Christopher D Town,et al.  Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. , 2004, The Plant journal : for cell and molecular biology.

[42]  K. Rieck,et al.  Large Scale Learning with String Kernels , 2006 .

[43]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[44]  William Stafford Noble,et al.  Support vector machine , 2013 .

[45]  B. Roe,et al.  Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes , 2006, Proceedings of the National Academy of Sciences.

[46]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[47]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[48]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[49]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.