Multivariate Models and Algorithms for Learning Correlation Structures from Replicated Molecular Profiling Data

Advances in high-throughput data acquisition technologies, e.g. microarray and next-generation sequencing, have resulted in the production of a myriad amount of molecular profiling data. Consequently, there has been an increasing interest in the development of computational methods to uncover gene association patterns underlying such data, e.g. gene clustering (Medvedovic & Sivaganesan, 2002; Medvedovic et al., 2004), inference of gene association networks (Altay and Emmert-Streib, 2010; Butte & Kohane, 2000; Zhu et al., 2005), sample classification (Yeung & Bumgarner, 2005) and detection of differentially expressed genes (Sartor et al., 2006). However, outcome of any bioinformatics analysis is directly influenced by the quality of molecular profiling data, which are often contaminated with excessive noise. Replication is a frequently used strategy to account for the noise introduced at various stages of a biomedical experiment and to achieve a reliable discovery of the underlying biomolecular activities. Particularly, estimation of the correlation structure of a gene set arises naturally in many pattern analyses of replicated molecular profiling data. In both supervised and unsupervised learning, performance of various data analysis methods, e.g. linear and quadratic discriminate analysis (Hastie et al., 2009), correlation-based hierarchial clustering (Eisen et al., 1998; de Hoon et al., 2004; Yeung et al., 2003) and co-expression networking (Basso et al., 2005; Boscolo et al., 2008) relies on an accurate estimate of the true correlation structure. The existing MLE (maximum likelihood estimate) based approaches to the estimation of correlation structure do not automatically accommodate replicated measurements. Often, an ad hoc step of data preprocessing by averaging (either weighted, unweighted or something in between) is used to reduce the multivariate structure of replicated data into bivariate one (Hughes et al., 2000; Yao et al., 2008; Yeung et al., 2003). Averaging is not completely satisfactory as it creates a strong bias while reducing the variance among replicates with diverse magnitudes. Moreover, averaging may lead to a significant amount of information loss, e.g. it may wipe out important patterns of small magnitudes or cancel out opposite patterns of similar magnitudes. Thus, it is necessary to design multivariate correlation estimators by treating each replicate exclusively as a random variable. In general, the experimental design that specifies replication mechanism of a gene set may be unknown 3

[1]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[2]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[3]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Adam A. Margolin,et al.  Reverse engineering of regulatory networks in human B cells , 2005, Nature Genetics.

[6]  R. Hathaway A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions , 1985 .

[7]  Frank Emmert-Streib,et al.  Revealing differences in gene network inference algorithms on the network level by ensemble methods , 2010, Bioinform..

[8]  Alfred O. Hero,et al.  Bayesian Hierarchical Model for Large-Scale Covariance Matrix Estimation , 2007, J. Comput. Biol..

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  V.P. Roychowdhury,et al.  An Information Theoretic Exploratory Method for Learning Patterns of Conditional Gene Coexpression from Microarray Data , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[12]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[13]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[14]  Hui Zhang,et al.  A Generalized Multivariate Approach to Pattern Discovery from Replicated and Incomplete Genome-Wide Measurements , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Mario Medvedovic,et al.  Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments , 2006, BMC Bioinformatics.

[16]  E. Rubin,et al.  Genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Roger E Bumgarner,et al.  Multiclass classification of microarray data with repeated measurements: application to cancer , 2003, Genome Biology.

[18]  Charles Kung,et al.  Chemical genomic profiling to identify intracellular targets of a multiplex kinase inhibitor. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Dongxiao Zhu,et al.  Estimating an Optimal Correlation Structure from Replicated Molecular Profiling Data Using Finite Mixture Models , 2009, 2009 International Conference on Machine Learning and Applications.

[20]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[21]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[22]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[23]  Salvatore Ingrassia,et al.  Constrained monotone EM algorithms for finite mixture of multivariate Gaussians , 2007, Comput. Stat. Data Anal..

[24]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[25]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[26]  Hua Li,et al.  Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data , 2007, Bioinform..

[27]  Yeung Sam Hung,et al.  Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient , 2008, BMC Bioinformatics.

[28]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[29]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[30]  S. Ingrassia A likelihood-based constrained algorithm for multivariate normal mixture models , 2004 .

[31]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[32]  Monika Milewski,et al.  Decoding randomly ordered DNA arrays. , 2004, Genome research.