GenRate: A Generative Model that Reveals Novel Transcripts in Genome-Tiling Microarray Data

Genome-wide microarray designs containing millions to hundreds of millions of probes are available for a variety of mammals, including mouse and human. These genome tiling arrays can potentially lead to significant advances in science and medicine, e.g., by indicating new genes and alternative primary and secondary transcripts. While bottom-up pattern matching techniques (e.g., hierarchical clustering) can be used to find gene structures in microarray data, we believe the many interacting hidden variables and complex noise patterns more naturally lead to an analysis based on generative models. We describe a generative model of tiling data and show how the sum-product algorithm can be used to infer hybridization noise, probe sensitivity, new transcripts, and alternative transcripts. The method, called GenRate, maximizes a global scoring function that enables multiple transcripts to compete for ownership of putative probes. We apply GenRate to a new exon tiling dataset from mouse chromosome 4 and show that it makes significantly more predictions than a previously described hierarchical clustering method at the same false positive rate. GenRate correctly predicts many known genes and also predicts new gene structures. As new problems arise, additional hidden variables can be incorporated into the model in a principled fashion, so we believe that GenRate will prove to be a useful tool in the new era of genome-wide tiling microarray analysis.

[1]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[2]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[3]  Daniel Kahneman,et al.  Probabilistic reasoning , 1993 .

[4]  P. Sharp,et al.  Splicing of precursors to mRNAs by the spliceosomes , 1993 .

[5]  M Vingron,et al.  An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome , 2003, Genome Biology.

[6]  Brendan J. Frey,et al.  Spatial Bias Removal in Microarray Images , 2003 .

[7]  B. Frey,et al.  Alternative splicing of conserved exons is frequently species-specific in human and mouse. , 2005, Trends in genetics : TIG.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  David K. Hanzel,et al.  Mining the human genome using microarrays of open reading frames , 2000, Nature Genetics.

[10]  J. Steitz,et al.  The expanding universe of noncoding RNAs. , 2006, Cold Spring Harbor symposia on quantitative biology.

[11]  R. Stoughton,et al.  Experimental annotation of the human genome using microarray technology , 2001, Nature.

[12]  Brendan J. Frey,et al.  A Panoramic View of Yeast Noncoding RNA Processing , 2003, Cell.

[13]  Brendan J. Frey,et al.  GenRate: A Generative Model That Finds and Scores New Genes and Exons in Genomic Microarray Data , 2004, Pacific Symposium on Biocomputing.

[14]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[15]  Vladimir Svetnik,et al.  A comprehensive transcript index of the human genome generated using microarrays and computational approaches , 2004, Genome Biology.

[16]  Brendan J. Frey,et al.  Inferring global levels of alternative splicing isoforms using a generative model of microarray data , 2006, Bioinform..

[17]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[18]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[19]  B. Frey,et al.  Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs , 2005, Nature Genetics.

[20]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[21]  Scott A. Rifkin,et al.  A Gene Expression Map for the Euchromatic Genome of Drosophila melanogaster , 2004, Science.

[22]  Franco Cerrina,et al.  Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. , 2002, Genome research.

[23]  Joseph M. Dale,et al.  Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome , 2003, Science.

[24]  Ming-Yang Kao,et al.  Fast Optimal Genome Tiling with Applications to Microarray Design and Homology Search , 2002, WABI.

[25]  Tommi S. Jaakkola,et al.  Physical network models and multi-source data integration , 2003, RECOMB '03.

[26]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[27]  Brendan J. Frey,et al.  Finding Novel Transcripts in High-Resolution Genome-Wide Microarray Data Using the GenRate Model , 2005, RECOMB.

[28]  Bosiljka Tasic,et al.  Alternative pre-mRNA splicing and proteome expansion in metazoans , 2002, Nature.

[29]  J. Rinn,et al.  The transcriptional activity of human Chromosome 22. , 2003, Genes & development.

[30]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[31]  B. Frey,et al.  Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. , 2004, Molecular cell.

[32]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[33]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[34]  Brendan J. Frey,et al.  Probabilistic Inference of Alternative Splicing Events in Microarray Data , 2004, NIPS.

[35]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[36]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[37]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[38]  Yudong D. He,et al.  Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer , 2001, Nature Biotechnology.