GenRate: A Generative Model That Finds and Scores New Genes and Exons in Genomic Microarray Data

Recently, researchers have made some progress in using microarrays to validate predicted exons in genome sequence and find new gene structures. However, current methods rely on separately making threshold-based decisions on intensity of expression, similarity of expression profiles, and arrangements of exons in the genome. We have taken a Bayesian approach and developed GenRate, a generative model that accounts for both genome-wide expression data taken from multiple conditions (e.g. tissues) and co-location and density of probes in DNA sequence data. GenRate balances probabilistic evidence derived from different sources and outputs scores (log-likelihoods) for each gene model, enabling the estimation of false-positive and false-negative rates. The model has a number of local minima that is exponential in the length of the DNA sequence data, so direct application of the EM learning algorithm produces poor results. We describe a novel way of parameterizing the model using examples from the data set, so that good solutions are found using an efficient algorithm. We apply GenRate to a subset of mouse genome-wide expression data that we have created, and discuss the statistical significance of the genes found by GenRate. Three of the highest-ranking gene structures found by GenRate, each containing thousands of bases from the genome, are confirmed using RT-PCR experiments.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  David K. Hanzel,et al.  Mining the human genome using microarrays of open reading frames , 2000, Nature Genetics.

[3]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[4]  Yudong D. He,et al.  Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer , 2001, Nature Biotechnology.

[5]  R. Stoughton,et al.  Experimental annotation of the human genome using microarray technology , 2001, Nature.

[6]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[7]  Franco Cerrina,et al.  Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. , 2002, Genome research.

[8]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[9]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[10]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[11]  Brendan J. Frey,et al.  Extending Factor Graphs so as to Unify Directed and Undirected Graphical Models , 2002, UAI.

[12]  Brendan J. Frey,et al.  Spatial Bias Removal in Microarray Images , 2003 .

[13]  Joseph M. Dale,et al.  Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome , 2003, Science.

[14]  M Vingron,et al.  An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome , 2003, Genome Biology.

[15]  J. Rinn,et al.  The transcriptional activity of human Chromosome 22. , 2003, Genes & development.

[16]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[17]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.