Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling.

DNA microarray is an important tool for the study of gene activities but the resultant data consisting of thousands of points are error-prone. A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or post-analysis follow-up tasks. In this study, we describe an approach based on normal mixture modeling for determining optimal signal intensity thresholds to identify reliable measurements of the microarray elements and subsequently eliminate false expression ratios. We used univariate and bivariate mixture modeling to segregate the microarray data into two classes, low signal intensity and reliable signal intensity populations, and applied Bayesian decision theory to find the optimal signal thresholds. The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional method when validated against a reference set of biological data that consisted of true and false gene expression data. Elimination of unreliable signal intensities in microarray data should contribute to the quality of microarray data including reproducibility and reliability of gene expression ratios.

[1]  M M Shoukri,et al.  Parametric estimation in a genetic mixture model with application to nuclear family data. , 1994, Biometrics.

[2]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[3]  David G. Stork,et al.  Pattern Classification , 1973 .

[4]  T. Golub,et al.  DNA microarrays in clinical oncology. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[5]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[6]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  K. Matsushima,et al.  Human cytomegalovirus induces interleukin-8 production by a human monocytic cell line, THP-1, through acting concurrently on AP-1- and NF-kappaB-binding sites of the interleukin-8 gene , 1997, Journal of virology.

[9]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[10]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[11]  Raymond J Carroll,et al.  DNA Microarray Experiments: Biological and Technological Aspects , 2002, Biometrics.

[12]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Angel R. Martinez,et al.  Computational Statistics Handbook with MATLAB , 2001 .

[14]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[15]  Tala Bakheet,et al.  ARED 2.0: an update of AU-rich element mRNA database , 2003, Nucleic Acids Res..

[16]  S. Al-Sedairy,et al.  A carbocyclic nucleoside analogue is a TNF-alpha inhibitor with immunosuppressive action: role of prostaglandin E2 and protein kinase C and comparison with pentoxifylline. , 1998, Cellular immunology.

[17]  G J McLachlan,et al.  Mixture models for partially unclassified data: a case study of renal venous renin in hypertension. , 1989, Statistics in medicine.

[18]  鈴木 拓児 Comprehensive gene expression profile of LPS-stimulated human monocytes by SAGE , 2001 .

[19]  R Traber,et al.  Induction of rapid IL-1 beta mRNA degradation in THP-1 cells mediated through the AU-rich region in the 3'UTR by a radicicol analogue. , 1996, Cytokine.

[20]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[22]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[23]  I. Mcmanus Bimodality of blood pressure levels. , 1983, Statistics in medicine.

[24]  J. Idier,et al.  Penalized Maximum Likelihood Estimation for Normal Mixture Distributions , 2003 .

[25]  M. R. Fielden,et al.  GP3: GenePix post-processing program for automated analysis of raw microarray data , 2002, Bioinform..

[26]  Ash A. Alizadeh,et al.  Gene Expression Signature of Fibroblast Serum Response Predicts Human Cancer Progression: Similarities between Tumors and Wounds , 2004, PLoS biology.

[27]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[28]  Tala Bakheet,et al.  p38 Mitogen-Activated Protein Kinase-Dependent and -Independent Signaling of mRNA Stability of AU-Rich Element-Containing Transcripts , 2003, Molecular and Cellular Biology.

[29]  Ken W. Y. Cho,et al.  Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. , 2002, Nucleic acids research.

[30]  Stephen R Quake,et al.  Significance and statistical errors in the analysis of DNA microarray data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  V. Quaranta,et al.  Defining signal thresholds in DNA microarrays: exemplary application for invasive cancer , 2002, BMC Genomics.

[32]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .