Finding large domains of similarly expressed genes

In this study, a new method for finding and defining large domains of adjacent genes on a chromosome with similar expression profiles was introduced based on the use of the minimum description length (MDL) principle and the recursive segmentation procedure. For the recursive segmentation, a newly introduced stopping criterion using the MDL principle was used. Together they offer a novel method to view the large domains of similarly expressed genes in genome data. The description of the genome data and of the large domain is done according to the MDL principle, which selects the model based on its fitting performance and also penalized a very high complexity of the model. The success of segmentation comes from observation that the more similar the gene expression profiles are in a large domain, the shorter the description of the data that represents the large domain. The new recursive segmentation method was applied to the microarray measurements of the Drosophila genome and human genome in order to demonstrate the ability of the new method to find large domains successfully.

[1]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[2]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[3]  Gerald M Rubin,et al.  Evidence for large domains of similarly expressed genes in the Drosophila genome , 2002, Journal of biology.

[4]  F. Baas,et al.  The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains , 2001, Science.

[5]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[6]  Martin J. Lercher,et al.  Clustering of housekeeping genes provides a unified model of gene order in the human genome , 2002, Nature Genetics.

[7]  Ioan Tabus,et al.  DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[8]  H E Stanley,et al.  Finding borders between coding and noncoding DNA regions by an entropic segmentation method. , 2000, Physical review letters.

[9]  Jaakko Astola,et al.  Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics , 2004, EURASIP J. Adv. Signal Process..

[10]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[11]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[12]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[13]  Heikki Mannila,et al.  An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries , 2002, Pacific Symposium on Biocomputing.

[14]  A. Brazma,et al.  Towards reconstruction of gene networks from expression data by supervised learning , 2003, Genome Biology.

[15]  Jaakko Astola,et al.  Classification and feature gene selection using the normalized maximum likelihood model for discrete regression , 2003, Signal Process..

[16]  G. Church,et al.  A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression , 2000, Nature Genetics.