Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage

Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L_1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously.

[1]  Hong Yan,et al.  Finding Correlated Biclusters from Gene Expression Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Xiaoxu Han,et al.  Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Samuel Kotz,et al.  The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance , 2001 .

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  Yvonne Braun,et al.  Profiling and functional annotation of mRNA gene expression in pediatric rhabdomyosarcoma and Ewing's sarcoma , 2004, International journal of cancer.

[6]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[7]  J. Zurada,et al.  Identification of Full and Partial Class Relevant Genes , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[12]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[13]  D. Mason,et al.  mb-1: a new marker for B-lineage lymphoblastic leukemia. , 1993, Blood.

[14]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[15]  J. Muschler,et al.  Dystroglycan: Emerging Roles in Mammary Gland Function , 2003, Journal of Mammary Gland Biology and Neoplasia.

[16]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[17]  Dao-Qing Dai,et al.  Protein Complexes Discovery Based on Protein-Protein Interaction Data via a Regularized Sparse Generative Network Model , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[19]  Michael Habeck,et al.  Robust probabilistic superposition and comparison of protein structures , 2010, BMC Bioinformatics.

[20]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Gerard Brady,et al.  Routine expression profiling of microarray gene signatures in acute leukaemia by real‐time PCR of human bone marrow * , 2005, British journal of haematology.

[22]  Dong-Ling Tong,et al.  Hybridising Genetic Algorithm-Neural Network (GANN) in marker genes detection , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[23]  Jin-Kao Hao,et al.  Fuzzy Logic for Elimination of Redundant Information of Microarray Data , 2008, Genom. Proteom. Bioinform..

[24]  Shili Lin,et al.  Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification , 2010, TCBB.

[25]  Lei Wang,et al.  Feature Selection with Kernel Class Separability , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  J. Wang-Rodriguez,et al.  In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[27]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[28]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[29]  H. Zou,et al.  The doubly regularized support vector machine , 2006 .

[30]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Feng Liu,et al.  Learning Microarray Gene Expression Data by Hybrid Discriminant Analysis , 2007, IEEE MultiMedia.

[32]  R. Tanguay,et al.  Cyclin B‐dependent kinase and caspase‐1 activation precedes mitochondrial dysfunction in fumarylacetoacetate‐induced apoptosis , 1999, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[33]  M. Linenberger,et al.  CD33-directed therapy with gemtuzumab ozogamicin in acute myeloid leukemia: progress in understanding cytotoxicity and potential mechanisms of drug resistance , 2005, Leukemia.

[34]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[36]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[38]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[39]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[40]  Jie Gui,et al.  Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction , 2010, Comput. Biol. Medicine.

[41]  E. Kohn,et al.  Insulin-like growth factor II acts as an autocrine growth and motility factor in human rhabdomyosarcoma tumors. , 1990, Cell growth & differentiation : the molecular biology journal of the American Association for Cancer Research.

[42]  Debashis Ghosh,et al.  Classification and Selection of Biomarkers in Genomic Data Using LASSO , 2005, Journal of biomedicine & biotechnology.

[43]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[44]  S. Fine,et al.  Elevated expression of caveolin-1 in adenocarcinoma of the colon. , 2001, American journal of clinical pathology.

[45]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[46]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[47]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[48]  Pablo M. Granitto,et al.  Clustering gene expression data with a penalized graph-based metric , 2011, BMC Bioinformatics.

[49]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Daniel Hernández-Lobato,et al.  Expectation Propagation for microarray data classification , 2010, Pattern Recognit. Lett..

[51]  Maria Kavallaris,et al.  Proteomic analysis reveals a novel role for the actin cytoskeleton in vincristine resistant childhood leukemia – An in vivo study , 2006, Proteomics.

[52]  Chin-Teng Lin,et al.  Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems , 2008, BMC Bioinformatics.

[53]  T. Macalma,et al.  Molecular Characterization of Human Zyxin* , 1996, The Journal of Biological Chemistry.

[54]  K W Kohn,et al.  Unscheduled activation of cyclin B1/Cdc2 kinase in human promyelocytic leukemia cell line HL60 cells undergoing apoptosis induced by DNA damage. , 1995, Cancer research.

[55]  C. Miething,et al.  Cell cycle progression of chronic lymphocytic leukemia cells is controlled by cyclin D2, cyclin D3, cyclin-dependent kinase (cdk) 4 and the cdk inhibitor p27 , 2002, Leukemia.

[56]  M. Ittmann,et al.  Secreted caveolin-1 stimulates cell survival/clonal growth and contributes to metastasis in androgen-insensitive prostate cancer. , 2001, Cancer research.

[57]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[58]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[59]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[60]  Ker-Chau Li,et al.  Exploring the within- and between-class correlation distributions for tumor classification , 2010, Proceedings of the National Academy of Sciences.

[61]  Hui-Ling Huang,et al.  ESVM: Evolutionary support vector machine for automatic feature selection and classification of microarray data , 2007, Biosyst..

[62]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[63]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[64]  Shinobu Saito,et al.  The MYO1F, unconventional myosin type 1F, gene is fused to MLL in infant acute monocytic leukemia with a complex translocation involving chromosomes 7, 11, 19 and 22 , 2005, Oncogene.

[65]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[66]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[67]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[68]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[69]  Ming Tan,et al.  Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[70]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[71]  L. Liu,et al.  Role of topoisomerase II in mediating epipodophyllotoxin-induced DNA cleavage. , 1984, Cancer research.