Computational Discovery of Motifs Using Hierarchical Clustering Techniques

Discovery of motifs plays a key role in understanding gene regulation in organisms. Existing tools for motif discovery demonstrate some weaknesses in dealing with reliability and scalability. Therefore, development of advanced algorithms for resolving this problem will be useful. This paper aims to develop data mining techniques for discovering motifs. A mismatch based hierarchical clustering algorithm is proposed in this paper, where three heuristic rules for classifying clusters and a post-processing for ranking and refining the clusters are employed in the algorithm. Our algorithm is evaluated using two sets of DNA sequences with comparisons. Results demonstrate that the proposed techniques in this paper outperform MEME, AlignACE and SOMBRERO for most of the testing datasets.

[1]  Edward C. Uberbacher,et al.  Background rareness-based iterative multiple sequence alignment algorithm for regulatory element detection , 2003, Bioinform..

[2]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[3]  Aaron Golden,et al.  Self-organizing neural networks to support the discovery of DNA-binding motifs , 2006, Neural Networks.

[4]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[5]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[6]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[7]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[8]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[9]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[10]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[11]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[12]  Huaguang Zhang,et al.  Motif discoveries in unaligned molecular sequences using self-organizing neural networks , 2006, IEEE Trans. Neural Networks.

[13]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[14]  T. D. Schneider,et al.  Consensus sequence Zen. , 2002, Applied bioinformatics.

[15]  Peer Bork,et al.  Self‐organizing hierarchic networks for pattern recognition in protein sequence , 1996, Protein science : a publication of the Protein Society.

[16]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[17]  Michael R. Green,et al.  Transcriptional regulatory elements in the human genome. , 2006, Annual review of genomics and human genetics.

[18]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[19]  Patrizio Arrigo,et al.  Identification of a new motif on nucleic acid sequence data using Kohonen's self-organizing map , 1991, Comput. Appl. Biosci..

[20]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[21]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[22]  Andrea Califano,et al.  Functional classification of proteins by pattern discovery and top-down clustering of primary sequences , 2001, IBM Syst. J..

[23]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[24]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[25]  Martha L. Bulyk,et al.  Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data , 2006, BMC Bioinformatics.

[26]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..