Estimating an Optimal Correlation Structure from Replicated Molecular Profiling Data Using Finite Mixture Models

Estimating the correlation structure of a gene set is an ubiquitous problem in many pattern analyses of replicated molecular profiling data. However, the commonly used Maximum Likelihood Estimates (MLE) approaches, do not automatically accommodate replicated measurements. Often, an ad hoc step of preprocessing e. g. averaging, either weighted, un-weighted or something in between is needed, which might wipe out important patterns of low magnitude and/or cancel out patterns of similar magnitude. We treat each replicate individually as a random variable and design a finite mixture model to estimate an optimal correlation structure from replicated molecular profiling data. Assuming that the measurements are independently, identically distributed (i. i. d.) samples from a mixture of two multivariate normal distributions, one with a constrained set of parameters and the other with an unconstrained parameter structure, we employ an Expectation-Maximization (EM) algorithm to estimate component parameters. We carry out a comparative study, including both simulations and real-world data analysis, to assess the estimation of correlation structure using the proposed model and the constrained model given by the first component of the mixture. The two models were further tested for their performances in clustering real-world data. The mixture model approach is shown to have an overall better performance.

[1]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[2]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[3]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[4]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[5]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[6]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[7]  Alfred O. Hero,et al.  Bayesian Hierarchical Model for Large-Scale Covariance Matrix Estimation , 2007, J. Comput. Biol..

[8]  Salvatore Ingrassia,et al.  Constrained monotone EM algorithms for finite mixture of multivariate Gaussians , 2007, Comput. Stat. Data Anal..

[9]  G. J. M La,et al.  ON COMPUTATIONAL ASPECTS OF CLUSTERING VIA MIXTURES OF NORMAL AND t-COMPONENTS , 1981 .

[10]  R. Hathaway A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions , 1985 .

[11]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[12]  Hua Li,et al.  Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data , 2007, Bioinform..

[13]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[14]  P. Deb Finite Mixture Models , 2008 .

[15]  S. Ingrassia A likelihood-based constrained algorithm for multivariate normal mixture models , 2004 .

[16]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[17]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.