Searching for transcription factor binding sites in vector spaces

BackgroundComputational approaches to transcription factor binding site identification have been actively researched in the past decade. Learning from known binding sites, new binding sites of a transcription factor in unannotated sequences can be identified. A number of search methods have been introduced over the years. However, one can rarely find one single method that performs the best on all the transcription factors. Instead, to identify the best method for a particular transcription factor, one usually has to compare a handful of methods. Hence, it is highly desirable for a method to perform automatic optimization for individual transcription factors.ResultsWe proposed to search for transcription factor binding sites in vector spaces. This framework allows us to identify the best method for each individual transcription factor. We further introduced two novel methods, the negative-to-positive vector (NPV) and optimal discriminating vector (ODV) methods, to construct query vectors to search for binding sites in vector spaces. Extensive cross-validation experiments showed that the proposed methods significantly outperformed the ungapped likelihood under positional background method, a state-of-the-art method, and the widely-used position-specific scoring matrix method. We further demonstrated that motif subtypes of a TF can be readily identified in this framework and two variants called the k NPV and k ODV methods benefited significantly from motif subtype identification. Finally, independent validation on ChIP-seq data showed that the ODV and NPV methods significantly outperformed the other compared methods.ConclusionsWe conclude that the proposed framework is highly flexible. It enables the two novel methods to automatically identify a TF-specific subspace to search for binding sites. Implementations are available as source code at:http://biogrid.engr.uconn.edu/tfbs_search/.

[1]  Mona Singh,et al.  A combinatorial optimization approach for diverse motif finding applications , 2006, Algorithms for Molecular Biology.

[2]  Wyeth W. Wasserman,et al.  ConSite: web-based prediction of regulatory elements using cross-species comparison , 2004, Nucleic Acids Res..

[3]  D. Stekel,et al.  Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction , 2010, Nucleic acids research.

[4]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[5]  Mona Singh,et al.  M are better than one: an ensemble-based motif finder and its application to regulatory element prediction , 2009, Bioinform..

[6]  David J. Arenillas,et al.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles , 2009, Nucleic Acids Res..

[7]  Benjamin Georgi,et al.  Context-specific independence mixture modeling for positional weight matrices , 2006, ISMB.

[8]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[9]  Sayan Mukherjee,et al.  Evidence-ranked motif identification , 2010, Genome Biology.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Esko Ukkonen,et al.  Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data , 2000, ISMB.

[12]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[13]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[14]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[15]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.

[16]  S Miyano,et al.  Open source clustering software. , 2004, Bioinformatics.

[17]  Julio Collado-Vides,et al.  RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation , 2007, Nucleic Acids Res..

[18]  Graziano Pesole,et al.  Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes , 2009, Nucleic Acids Res..

[19]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[20]  J. Helden,et al.  Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules , 2008, Nature Protocols.

[21]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[22]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[23]  D. S. Chekmenev,et al.  P-Match: transcription factor binding site search by combining patterns and weight matrices , 2005, Nucleic Acids Res..

[24]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[25]  David K. Gifford,et al.  Negative Information for Motif Discovery , 2004, Pacific Symposium on Biocomputing.

[26]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[27]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[28]  Sanguthevar Rajasekaran,et al.  Exact Algorithms for Planted Motif Problems , 2005, J. Comput. Biol..

[29]  J. Schug Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence , 2003, Current protocols in bioinformatics.

[30]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[31]  J. H. Shinn,et al.  Minimotif Miner: a tool for investigating protein function , 2006, Nature Methods.

[32]  Sridhar Hannenhalli,et al.  Enhanced position weight matrices using mixture models , 2005, ISMB.

[33]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[34]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[35]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..