Semi-supervised learning via penalized mixture model with application to microarray sample classification

MOTIVATION It is biologically interesting to address whether human blood outgrowth endothelial cells (BOECs) belong to or are closer to large vessel endothelial cells (LVECs) or microvascular endothelial cells (MVECs) based on global expression profiling. An earlier analysis using a hierarchical clustering and a small set of genes suggested that BOECs seemed to be closer to MVECs. By taking advantage of the two known classes, LVEC and MVEC, while allowing BOEC samples to belong to either of the two classes or to form their own new class, we take a semi-supervised learning approach; for high-dimensional data as encountered here, we propose a penalized mixture model with a weighted L1 penalty to realize automatic feature selection while fitting the model. RESULTS We applied our penalized mixture model to a combined dataset containing 27 BOEC, 28 LVEC and 25 MVEC samples. Analysis results indicated that the BOEC samples appeared to form their own new class. A simulation study confirmed that, compared with the standard mixture model with or without initial variable selection, the penalized mixture model performed much better in identifying relevant genes and forming corresponding clusters. The penalized mixture model seems to be promising for high-dimensional data with the capability of novel class discovery and automatic feature selection.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  R. Hebbel,et al.  Origins of circulating endothelial cells and endothelial outgrowth from blood. , 2000, The Journal of clinical investigation.

[3]  Sylvia Richardson,et al.  Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments , 2002, J. Comput. Biol..

[4]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[5]  Wei Pan,et al.  A comparative study of discriminating human heart failure etiology using gene expression profiles , 2005, BMC Bioinformatics.

[6]  Shili Lin,et al.  Class discovery and classification of tumor samples using mixture modeling of gene expression data - a unified approach , 2004, Bioinform..

[7]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[8]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[9]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[10]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[11]  Alex Lewin,et al.  A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments , 2004, Bioinform..

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  T. Wick,et al.  Human dermal microvascular endothelial but not human umbilical vein endothelial cells express CD36 in vivo and in vitro. , 1992, Journal of immunology.

[14]  Liming Chang,et al.  Use of blood outgrowth endothelial cells for gene therapy for hemophilia A. , 2002, Blood.

[15]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Tom M. Mitchell,et al.  Semi-Supervised Text Classification Using EM , 2006, Semi-Supervised Learning.

[17]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[18]  Wei Pan,et al.  Genetic Influence on the Systems Biology of Sickle Stroke Risk Detected by Endothelial Gene Expression. , 2005 .

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[21]  B. Efron The Estimation of Prediction Error , 2004 .

[22]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[24]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[25]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[26]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  S. Pandey,et al.  What Are Degrees of Freedom , 2008 .

[29]  David Botstein,et al.  Endothelial cell diversity revealed by global expression profiling , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[31]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[32]  Xiaotong Shen,et al.  Adaptive Model Selection , 2002 .