Learning Information Patterns in Biological Databases - Stochastic Data Mining

This chapter aims at developing the computational theory for modeling patterns and their hierarchical coordination within biological sequences. With the exception of the promoters and enhancers, the functional significance of the non-coding DNA is not well understood. Scientists are now discovering that specific regions of non-coding DNA interact with the cellular machinery and help bring about the expression of genes. Our premise is that it is possible to study the arrangements of patterns in biological sequences through machine learning algorithms. As the biological database continue their exponential growth, it becomes feasible to apply in-silico Data Mining algorithms to discover interesting patterns of motif arrangements and the frequency of their re-iteration. A systematic procedure for achieving this goal is presented.

[1]  P Bucher,et al.  CCAAT box revisited: bidirectionality, location and context. , 1988, Journal of biomolecular structure & dynamics.

[2]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[3]  S. Kullback,et al.  The Information in Contingency Tables , 1980 .

[4]  T. Schlake,et al.  Scaffold/matrix-attached regions: topological switches with multiple regulatory functions. , 1996, Critical reviews in eukaryotic gene expression.

[5]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[6]  Rex Page,et al.  Algorithm 479: A minimal spanning tree clustering method , 1974, CACM.

[7]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .

[8]  Silke Meyer,et al.  Compilation of vertebrate-encoded transcription factors , 1992, Nucleic Acids Res..

[9]  David Gleicher A Statistical Study , 2006 .

[10]  Alexander E. Kel,et al.  TRANSCompel®: a database on composite regulatory elements in eukaryotic genes , 2002, Nucleic Acids Res..

[11]  Rolph E. Anderson,et al.  Multivariate Data Analysis with Readings , 1979 .

[12]  Ines Liebich,et al.  S/MARt DB: a database on scaffold/matrix attached regions , 2002, Nucleic Acids Res..

[13]  Ruth Nussinov Signals in DNA sequences and their potential properties , 1991, Comput. Appl. Biosci..

[14]  Xin Chen,et al.  The TRANSFAC system on gene expression regulation , 2001, Nucleic Acids Res..

[15]  R. Roeder,et al.  The role of general initiation factors in transcription by RNA polymerase II. , 1996, Trends in biochemical sciences.

[16]  S A Krawetz,et al.  Mathematical model to predict regions of chromatin attachment to the nuclear matrix. , 1997, Nucleic acids research.

[17]  L. O'Brien The statistical analysis of contingency table designs , 1989 .

[18]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[19]  D. Ghosh,et al.  A relational database of transcription factors. , 1990, Nucleic acids research.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  L. Ashworth,et al.  Construction of a chromosome specific library of human MARs and mapping of matrix attachment regions on human chromosome 19. , 1996, Nucleic acids research.

[22]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[23]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[24]  F E Penotti,et al.  Human DNA TATA boxes and transcription initiation sites. A statistical study. , 1990, Journal of molecular biology.

[25]  L. Kleinsmith,et al.  Principles of cell and molecular biology , 1995 .

[26]  David Ghosh,et al.  OOTFD (Object-Oriented Transcription Factors Database): an object- oriented successor to TFD , 1998, Nucleic Acids Res..

[27]  L. Hartwell,et al.  Cell cycle control and cancer. , 1994, Science.

[28]  J. T. Kadonaga Eukaryotic Transcription: An Interlaced Network of Transcription Factors and Chromatin-Modifying Machines , 1998, Cell.