Semi-supervised classification using sparse representation for cancer recurrence prediction

Gene expression profiles have been used to predict cancer recurrence or other clinical outcomes of cancer patients. However, clinical information of cancer patients is often incomplete, which yields many unlabeled samples that cannot be used in supervised learning. In this is paper, we develop a novel semi-supervised leaning (SSL) method that uses both labeled and unlabeled patient samples to predict cancer recurrence. Our new SSL algorithm employs a sparse representation approach where a labeled sample is represented as a combination of a small number of properly chosen unlabeled samples. Experiments with a set of gene expression data from patients with colorectal cancer(CRC) demonstrate that our SSL algorithm can improve prediction accuracy compared to other two SSL methods including TSVM and T3VM, and the traditional support vector machine.

[1]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[2]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[3]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[4]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[5]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  J. Haerting,et al.  Gene-expression signatures in breast cancer. , 2003, The New England journal of medicine.

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Hiroshi Mamitsuka,et al.  Efficient semi-supervised learning on locally informative multiple graphs , 2012, Pattern Recognit..

[10]  O. Mangasarian,et al.  Semi-Supervised Support Vector Machines for Unlabeled Data Classification , 2001 .

[11]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  L. Pusztai,et al.  Gene expression profiling in breast cancer: classification, prognostication, and prediction , 2011, The Lancet.

[14]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[15]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[19]  Bing Zhang,et al.  Semi-supervised learning improves gene expression-based prediction of cancer recurrence , 2011, Bioinform..

[20]  S. Linn,et al.  Identification of a low-risk subgroup of HER-2-positive breast cancer by the 70-gene prognosis signature , 2009, British Journal of Cancer.