Identification of biologically relevant subtypes via preweighted sparse clustering

Cluster analysis methods are used to identify homogeneous subgroups in a data set. Frequently one applies cluster analysis in order to identify biologically interesting subgroups. In particular, one may wish to identify subgroups that are associated with a particular outcome of interest. Conventional clustering methods often fail to identify such subgroups, particularly when there are a large number of high-variance features in the data set. Conventional methods may identify clusters associated with these high-variance features when one wishes to obtain secondary clusters that are more interesting biologically or more strongly associated with a particular outcome of interest. We describe a modification of the sparse clustering method of Witten and Tibshirani (2010) that can be used to identify such secondary clusters or clusters associated with an outcome of interest. We show that this method can correctly identify such clusters of interest in several simulation scenarios. The method is also applied to a large case-control study of TMD and a leukemia microarray data set.

[1]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[2]  Debashis Ghosh,et al.  A transcriptional fingerprint of estrogen in human breast cancer predicts patient survival. , 2008, Neoplasia.

[3]  Roger B. Fillingim,et al.  Cluster analysis of multiple experimental pain modalities , 2005, Pain.

[4]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[7]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[8]  R. Ohrbach,et al.  Summary of findings from the OPPERA baseline case-control study: implications and future directions. , 2011, The journal of pain : official journal of the American Pain Society.

[9]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[10]  R. Tibshirani,et al.  Complementary hierarchical clustering. , 2008, Biostatistics.

[11]  Robert N. Jamison,et al.  Empirically derived Symptom Checklist 90 subgroups of chronic pain patients: A cluster analysis , 1988, Journal of Behavioral Medicine.

[12]  Eric Bair,et al.  Semi‐supervised clustering methods , 2013, Wiley interdisciplinary reviews. Computational statistics.

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  Miroslav Backonja,et al.  Complex regional pain syndrome: are there distinct subtypes and sequential stages of the syndrome? , 2002, Pain.

[15]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[19]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[20]  Eric Bair,et al.  Study methods, recruitment, sociodemographic findings, and demographic representativeness in the OPPERA study. , 2011, The journal of pain : official journal of the American Pain Society.

[21]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[22]  B. Naliboff,et al.  Multidimensional subgroups in migraine: differential treatment outcome to a pain medicine program. , 2003, Pain medicine.

[23]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[25]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[26]  W. Maixner,et al.  Idiopathic pain disorders – Pathways of vulnerability , 2006, PAIN.

[27]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[28]  Shizhong Xu,et al.  Supervised cluster analysis for microarray data based on multivariate Gaussian mixture , 2004, Bioinform..

[29]  Eric Bair,et al.  Potential autonomic risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case-control study. , 2011, The journal of pain : official journal of the American Pain Society.

[30]  Eric Bair,et al.  Pain sensitivity risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case control study. , 2011, The journal of pain : official journal of the American Pain Society.

[31]  Teresa M. Erb,et al.  Evaluation and Management of Dysmenorrhea in Adolescents , 2008, Clinical obstetrics and gynecology.

[32]  S. Dworkin,et al.  Research diagnostic criteria for temporomandibular disorders: review, criteria, examinations and specifications, critique. , 1992, Journal of craniomandibular disorders : facial & oral pain.

[33]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[34]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[35]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[36]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[37]  William Maixner,et al.  Orofacial pain prospective evaluation and risk assessment study--the OPPERA study. , 2011, The journal of pain : official journal of the American Pain Society.

[38]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[39]  R. Ohrbach,et al.  Potential psychosocial risk factors for chronic TMD: descriptive data and empirically identified domains from the OPPERA case-control study. , 2011, The journal of pain : official journal of the American Pain Society.

[40]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.