PROBER: Segmentation and Differential Analysis Tool for Tiling Microarray Data

Identification of gene structures is key to genome annotation, and analysis of alternative RNA processing is critical to understand transcriptome diversity and genome coding capacity. Tiling microarray has become an important tool for genome-wide analysis of gene expression due to its high resolution, high density and throughput. However, for the huge volume of data and limited probe design options, analysis of tiling microarray data is neither easy nor effective. PROBER, a tool for integrated analysis of massive tiling microarray data, is developed to identify gene structures and analyze differential gene expression. In PROBER, after normalization of microarray data, structural change model is used to identify gene structures. T-Test, wavelet and other methods are then adopted to analyze differential expression of genes under different conditions. By combining tiling microarray technology and applying computer recognition models and algorithms, applications of PROBER in gene annotation and alternative RNA processing are demonstrated. Tiling microarray, developing from the traditional gene- chip technology, makes it possible for high throughput genome-wide detection of gene expression. The development of such large-scale experimental techniques is accompanied by huge volume of data, while the tools and methods for analysis of the mass data are far from meeting the needs of genome research. Accurate and effective analysis of the mass data to mine meaningful biological information for study of gene expression, which is also the current main research of bioinformatics, has become a hurdle for the further development of tiling microarray technology. Gene structures are the basis of genome annotation. However, under different conditions or developmental stages, genes may be expressed in different way in terms of exonic and intronic contents and 5'- and 3' untranslated regions (UTR). Such changes often cause the production of different transcripts (messenger RNA, in this case), leading to potential functional alteration of the gene. Hence, accurate gene structure based on the final mRNA structure is more important than the gene structure in the DNA level. One of the best utilities of tiling microarray is to observe the changes of gene structure under different conditions, in which precursor mRNA may be processed differently and yielding different final mRNA products. Such a differential gene expression regime is quite common in higher eukaryotic organisms (references for alternative splicing and polyadenylation), thus it is considered to one of the mechanisms of the regulation process of cellular activities. Currently, there are many tools available for gene prediction to extract and analyze genome sequences (1-3). However, these tools have some limitations in practical uses, such as limitations in species; uncertainty of some gene structures predicted; low prediction accuracy of some newly discovered genes; too sensitive to the sequence noises. Identification of gene structure, especially analysis of differential gene expression has an important impact on the study of gene expression regulation as well as growth and development. However, the current tools fail to meet the needs of biologists who have an urgent need for the development of tools for the genome-wide data mining of biological information. We have developed a tool called PROBER by combining several algorithms and mathematical models for data mining of massive tiling microarray data. In PROBER, the data is normalized first; and then segmentation model is built to identify gene structures effectively to annotate genes based on the experimental data; next, different models for analysis of differential gene expression can be built based on the transcripts, exons and introns obtained from gene structures. PROBER will provide genome resources such as relative algorithms, models and gene-chips to biologists. Thus, PROBER can help to discover gene expression regulation and be applied to other biological systems other than the tested Arabidopsis gene expression.

[1]  P. Perron,et al.  Computation and analysis of multiple structural , 2000 .

[2]  Wolfgang Huber,et al.  Transcript mapping with high-density oligonucleotide tiling arrays , 2006, Bioinform..

[3]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[4]  Hongwei Zhao,et al.  Arabidopsis PCFS4, a homologue of yeast polyadenylation factor Pcf11p, regulates FCA alternative processing and promotes flowering time. , 2008, The Plant journal : for cell and molecular biology.

[5]  Charles L. Kooperberg,et al.  Improved Background Correction for Spotted DNA Microarrays , 2002, J. Comput. Biol..

[6]  B. Tian,et al.  Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. , 2005, RNA.

[7]  Xiaohui Wu,et al.  Predictive modeling of plant messenger RNA polyadenylation sites , 2007, BMC Bioinformatics.

[8]  Achim Zeileis,et al.  Validating multiple structural change models : A case study , 2005 .

[9]  Robert M. Miura,et al.  Prediction of mRNA polyadenylation sites by support vector machine , 2006, Bioinform..

[10]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[11]  G. Sherlock Analysis of large-scale gene expression data. , 2000, Current opinion in immunology.

[12]  Wolfgang Huber,et al.  A high-resolution map of transcription in the yeast genome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.