Supervised clustering of high dimensional data using regularized mixture modeling

Identifying relationships between molecular variations and their clinical presentations has been challenged by the heterogeneous causes of a disease. It is imperative to unveil the relationship between the high dimensional molecular manifestations and the clinical presentations, while taking into account the possible heterogeneity of the study subjects. We proposed a novel supervised clustering algorithm using penalized mixture regression model, called CSMR, to deal with the challenges in studying the heterogeneous relationships between high dimensional molecular features to a phenotype. The algorithm was adapted from the classification expectation maximization algorithm, which offers a novel supervised solution to the clustering problem, with substantial improvement on both the computational efficiency and biological interpretability. Experimental evaluation on simulated benchmark datasets demonstrated that the CSMR can accurately identify the subspaces on which subset of features are explanatory to the response variables, and it outperformed the baseline methods. Application of CSMR on a drug sensitivity dataset again demonstrated the superior performance of CSMR over the others, where CSMR is powerful in recapitulating the distinct subgroups hidden in the pool of cell lines with regards to their coping mechanisms to different drugs. CSMR represents a big data analysis tool with the potential to resolve the complexity of translating the clinical manifestations of the disease to the real causes underpinning it. We believe that it will bring new understanding to the molecular basis of a disease, and could be of special relevance in the growing field of personalized medicine.

[1]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[2]  Weixin Yao,et al.  A new method for robust mixture regression , 2017, The Canadian journal of statistics = Revue canadienne de statistique.

[3]  G. Orphanides,et al.  Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines , 2012, BMC Medical Genomics.

[4]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[5]  Justin Guinney,et al.  Systematic Assessment of Analytical Methods for Drug Sensitivity Prediction from Cancer Cell Line Data , 2013, Pacific Symposium on Biocomputing.

[6]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[7]  M. Tanner,et al.  Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum , 1999 .

[8]  Jeffrey S. Morris,et al.  The Consensus Molecular Subtypes of Colorectal Cancer , 2015, Nature Medicine.

[9]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[10]  Shili Lin,et al.  Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space. , 2011, Biostatistics.

[11]  Chi Zhang,et al.  Denoising Individual Bias for Fairer Binary Submatrix Detection , 2020, CIKM.

[12]  Chi Zhang,et al.  Fast and Efficient Boolean Matrix Factorization by Geometric Segmentation , 2019, AAAI.

[13]  Changlin Wan,et al.  Bi-clustering based biological and clinical characterization of colorectal cancer in complementary to CMS classification , 2018, bioRxiv.

[14]  Dankmar Böhning,et al.  Computer-Assisted Analysis of Mixtures and Applications: Meta-Analysis, Disease Mapping, and Others , 1999 .

[15]  Yu Zhang,et al.  LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data , 2019, Nucleic acids research.

[16]  Jianqing Fan,et al.  Comments on: ℓ1-penalization for mixture regression models , 2010 .

[17]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[18]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[19]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[20]  Gilda Soromenho,et al.  Fitting mixtures of linear regressions , 2010 .

[21]  Christian Hennig,et al.  Identifiablity of Models for Clusterwise Linear Regression , 2000, J. Classif..

[22]  Juan Xie,et al.  QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data. , 2020, Bioinformatics.

[23]  Johannes Blömer,et al.  Hard-Clustering with Gaussian Mixture Models , 2016, ArXiv.

[24]  R. Hathaway A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions , 1985 .

[25]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[26]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[27]  Geoffrey J. McLachlan,et al.  A globally convergent algorithm for lasso-penalized mixture of linear regression models , 2016, Comput. Stat. Data Anal..

[28]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[29]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[30]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[31]  R. Hathaway A constrained EM algorithm for univariate normal mixtures , 1986 .

[32]  Richard E. Quandt,et al.  The Estimation of Structural Shifts by Switching Regressions , 1973 .

[33]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[34]  S. Leung,et al.  Ovarian Carcinoma Subtypes Are Different Diseases: Implications for Biomarker Studies , 2008, PLoS medicine.

[35]  Sridhar Ramaswamy,et al.  Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells , 2012, Nucleic Acids Res..

[36]  Joshua C. Gilbert,et al.  An Interactive Resource to Identify Cancer Genetic and Lineage Dependencies Targeted by Small Molecules , 2013, Cell.

[37]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[38]  Emilie Devijver,et al.  Finite mixture regression: A sparse variable selection by model selection for clustering , 2014, 1409.1331.

[39]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[40]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[41]  K. Polyak,et al.  Tumor heterogeneity: causes and consequences. , 2010, Biochimica et biophysica acta.

[42]  S. Geer,et al.  ℓ1-penalization for mixture regression models , 2010, 1202.6046.

[43]  Faming Liang,et al.  Drug sensitivity prediction with high-dimensional mixture regression , 2019, PloS one.

[44]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .