Using a Solver Over the String Pattern Domain to Analyze Gene Promoter Sequences

This chapter illustrates how inductive querying techniques can be used to support knowledge discovery from genomic data. More precisely, it presents a data mining scenario to discover putative transcription factor binding sites in gene promoter sequences. We do not provide technical details about the used constraintbased data mining algorithms that have been previously described. Our contribution is to provide an abstract description of the scenario, its concrete instantiation and also a typical execution on real data. Our main extraction algorithm is a complete solver dedicated to the string pattern domain: it computes string patterns that satisfy a given conjunction of primitive constraints. We also discuss the processing steps necessary to turn it into a useful tool. In particular, we introduce a parameter tuning strategy, an appropriate measure to rank the patterns, and the post-processing approaches that can be and have been applied.

[1]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[2]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[3]  Luc De Raedt,et al.  An Efficient Algorithm for Mining String Databases Under Constraints , 2004, KDID.

[4]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[5]  Jean-François Boulicaut,et al.  Looking for monotonicity properties of a similarity constraint on sequences , 2006, SAC '06.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[8]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[9]  Jean-François Boulicaut,et al.  Introducing Softness into Inductive Queries on String Databases , 2006, DB&IS.

[10]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[11]  Uri Keich,et al.  U Subtle motifs: defining the limits of motif finding algorithms , 2002, Bioinform..

[12]  Jean-François Boulicaut,et al.  Parameter Tuning for Differential Mining of String Patterns , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[13]  Zohar Yakhini,et al.  Discovering Motifs in Ranked Lists of DNA Sequences , 2007, PLoS Comput. Biol..

[14]  Jean-François Boulicaut,et al.  Mining String Data under Similarity and Soft-Frequency Constraints: Application to Promoter Sequence Analysis , 2009 .

[15]  Luc De Raedt,et al.  A theory of inductive query answering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  Jean-François Boulicaut,et al.  Extracting Signature Motifs from Promoter Sets of Differentially Expressed Genes , 2009, Silico Biol..

[17]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[18]  M. Sagot,et al.  Promoter sequences and algorithmical methods for identifying them. , 1999, Research in microbiology.

[19]  Kimmo Hätönen,et al.  Constraint-Based Mining and Inductive Databases , 2006 .

[20]  Olivier Gandrillon,et al.  Large-scale analysis by SAGE reveals new mechanisms of v-erbA oncogene action , 2007, BMC Genomics.