An Iterative Learning Algorithm for Deciphering Stegoscripts: a Grammatical Approach for Motif Discovery

Steganography, or information hiding, is to conceal the existence of messages so as to protect their confidentiality. We consider deciphering a stegoscript, a text with secret messages embedded within a covertext, and identifying the vocabularies used in the messages, with no knowledge of the vocabularies and grammar in which the script was written. Our research was motivated by the problem of identifying conserved non-coding functional elements (motifs) in regulatory regions of genome sequences, which we view as stegoscripts constructed by nature with a statistical model consisting of a dictionary and a grammar. We develop an iterative learning algorithm, WordSpy, to learn such a model from a stegoscript. The model then can be applied to identify the embedded secret messages, i.e., the functional motifs. Our algorithm can successfully recover the most possible text of the first ten chapters of a novel embedded in a stegoscript and identify the transcription factor binding motifs in the upstream regions of ∼ 800 yeast genes.

[1]  Michael Q. Zhang,et al.  Identifying combinatorial regulation of transcription factors and binding motifs , 2004, Genome Biology.

[2]  P. Blaiseau,et al.  Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA , 1998, The EMBO journal.

[3]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[4]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[5]  J. Collado-Vides,et al.  A web site for the computational analysis of yeast regulatory sequences , 2000, Yeast.

[6]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[7]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[8]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[9]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[10]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Fields,et al.  The yeast STE12 protein binds to the DNA sequence mediating pheromone induction. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[13]  Mireille Régnier,et al.  A unified approach to word statistics , 1998, RECOMB '98.

[14]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[15]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[16]  R. Tjian,et al.  Orchestrated response: a symphony of transcription factors for gene control. , 2000, Genes & development.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  William Stafford Noble,et al.  Searching for statistically significant regulatory modules , 2003, ECCB.

[19]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[20]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[21]  Martin C. Frith,et al.  Detection of cis -element clusters in higher eukaryotic DNA , 2001, Bioinform..

[22]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[23]  A. Brivanlou,et al.  Signal Transduction and the Control of Gene Expression , 2002, Science.

[24]  Wei Wu,et al.  LOGOS: a modular Bayesian model for de novo motif detection , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[25]  Kenji Yamada,et al.  A Computational Approach to Deciphering Unknown Scripts , 1999 .