A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups

MOTIVATION Due to advances in experimental technologies, such as microarray, mass spectrometry and nuclear magnetic resonance, it is feasible to obtain large-scale data sets, in which measurements for a large number of features can be simultaneously collected. However, the sample sizes of these data sets are usually small due to their relatively high costs, which leads to the issue of concordance among different data sets collected for the same study: features should have consistent behavior in different data sets. There is a lack of rigorous statistical methods for evaluating this concordance or discordance. METHODS Based on a three-component normal-mixture model, we propose two likelihood ratio tests for evaluating the concordance and discordance between two large-scale data sets with two sample groups. The parameter estimation is achieved through the expectation-maximization (E-M) algorithm. A normal-distribution-quantile-based method is used for data transformation. RESULTS To evaluate the proposed tests, we conducted some simulation studies, which suggested their satisfactory performances. As applications, the proposed tests were applied to three SELDI-MS data sets with replicates. One data set has replicates from different platforms and the other two have replicates from the same platform. We found that data generated by SELDI-MS showed satisfactory concordance between replicates from the same platform but unsatisfactory concordance between replicates from different platforms. AVAILABILITY The R codes are freely available at http://home.gwu.edu/~ylai/research/Concordance.

[1]  Robert Podolsky,et al.  Assessing the utility of SELDI‐TOF and model averaging for serum proteomic biomarker discovery , 2006, Proteomics.

[2]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[3]  Carl Murie,et al.  A methodology for global validation of microarray experiments , 2006, BMC Bioinformatics.

[4]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[5]  F. Bosch,et al.  Proteomic analysis reveals successive aberrations in protein expression from healthy mucosa to invasive head and neck cancer , 2007, Oncogene.

[6]  Habtom W. Ressom,et al.  Analysis of mass spectral serum profiles for biomarker selection , 2005, Bioinform..

[7]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[8]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[9]  Z. Bhujwalla,et al.  Molecular Causes of the Aberrant Choline Phospholipid Metabolism in Breast Cancer , 2004, Cancer Research.

[10]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[11]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[12]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[13]  Daniel Q. Naiman,et al.  Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data , 2005, Bioinform..

[14]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[15]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[16]  Liliana Florea,et al.  List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. , 2005, Gene.

[17]  D. Ward,et al.  Identification of serum biomarkers for colon cancer by proteomic analysis , 2006, British Journal of Cancer.

[18]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[19]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[20]  Geoffrey J. McLachlan,et al.  A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays , 2006, Bioinform..

[21]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.