MixThres : mixture models to define a hybridization threshold in DNA microarray experiments

Even if one of the major applications of two-color DNA microarray hybridizations is to detect differentially expressed genes using intensity log-ratios, single channel signals provide also useful information as absolute value measurements which allow the description of gene expression patterns. In this context, it becomes crucial to determine the set of probes that hybridize, that is for which the intensity signal is greater than a hybridization threshold to be fixed. Existing procedures are either an arbitrary thresholding or require the knowledge of a population of non-hybridized probes. In this work we present the MixThres method to determine an adaptive hybridization threshold from intensity levels of the complete set of probes hybridized on a chip. We define a hybridization threshold based on the histogram of the probe intensity values. Our procedure is divided into two steps. First the intensity distribution is estimated using mixture models. Second a hybridization threshold is defined from the components of the mixture. We validate our method on DNA tiling array and expression array data. We show that our method has a good reproducibility, its specificity is greater than 97 % and its precision of 88 %. The R package MixThres is available at http://www.agroparistech.fr/mia/outil.html

[1]  M. Martin-Magniette,et al.  Genome-scale Arabidopsis promoter array identifies targets of the histone acetyltransferase GCN5. , 2008, The Plant journal : for cell and molecular biology.

[2]  Masahiro Kuroda,et al.  Acceleration of the EM algorithm using the vector epsilon algorithm , 2008, Comput. Stat..

[3]  Frédérique Bitton,et al.  CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform , 2007, Nucleic Acids Res..

[4]  M. Martin-Magniette,et al.  Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome , 2007, BMC Genomics.

[5]  Vincent Colot,et al.  Arabidopsis TFL2/LHP1 Specifically Associates with Genes Marked by Trimethylation of Histone H3 Lysine 27 , 2007, PLoS genetics.

[6]  Masahiro Kurodaa,et al.  Accelerating the convergence of the EM algorithm using the vector algorithm , 2006 .

[7]  Scott A. Rifkin,et al.  A Gene Expression Map for the Euchromatic Genome of Drosophila melanogaster , 2004, Science.

[8]  Michael Black,et al.  Role of transposable elements in heterochromatin and epigenetic control , 2004, Nature.

[9]  Sylvain Duchêne,et al.  FLAGdb++: a database for the functional analysis of the Arabidopsis genome , 2004, Nucleic Acids Res..

[10]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[11]  V. Quaranta,et al.  Defining signal thresholds in DNA microarrays: exemplary application for invasive cancer , 2002, BMC Genomics.

[12]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[13]  P. Sorger,et al.  Image metrics in the statistical analysis of DNA microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .