Statistical Applications in Genetics and Molecular Biology Sparse Logistic Regression with Lp Penalty for Biomarker Identification

In this paper, we propose a novel method for sparse logistic regression with non-convex regularization Lp (p <1). Based on smooth approximation, we develop several fast algorithms for learning the classifier that is applicable to high dimensional dataset such as gene expression. To the best of our knowledge, these are the first algorithms to perform sparse logistic regression with an Lp and elastic net (Le) penalty. The regularization parameters are decided through maximizing the area under the ROC curve (AUC) of the test data. Experimental results on methylation and microarray data attest the accuracy, sparsity, and efficiency of the proposed algorithms. Biomarkers identified with our methods are compared with that in the literature. Our computational results show that Lp Logistic regression (p <1) outperforms the L1 logistic regression and SCAD SVM. Software is available upon request from the first author.

[1]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[4]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[5]  Ron Kohavi,et al.  The Wrapper Approach , 1998 .

[6]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[7]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[9]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Gérard Dreyfus,et al.  Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model , 2000, Neurocomputing.

[12]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[15]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[16]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[17]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[18]  A D Long,et al.  Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework , 2001, The Journal of Biological Chemistry.

[19]  William Stafford Noble,et al.  Analysis of strain and regional variation in gene expression in mouse brain , 2001, Genome Biology.

[20]  P. Laird,et al.  Hierarchical clustering of lung cancer cell lines using DNA methylation markers. , 2002, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[21]  Iñaki Inza,et al.  Gene selection by sequential search wrapper approaches in microarray cancer class prediction , 2002, J. Intell. Fuzzy Syst..

[22]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  D. Donoho,et al.  Maximal Sparsity Representation via l 1 Minimization , 2002 .

[24]  Léon Personnaz,et al.  MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling , 2003, J. Mach. Learn. Res..

[25]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[26]  C. Ling,et al.  AUC: a Statistically Consistent and more Discriminating Measure than Accuracy , 2003, IJCAI.

[27]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[28]  Peter W. Laird,et al.  A comparison of cluster analysis methods using DNA methylation data , 2004, Bioinform..

[29]  Dmitry M. Malioutov,et al.  A sparse signal reconstruction perspective for source localization with sensor arrays , 2005, IEEE Transactions on Signal Processing.

[30]  Jiangsheng Yu,et al.  Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data , 2005, ISMB.

[31]  Jian Liu,et al.  Logistic support vector machines and their application to gene expression data , 2005, Int. J. Bioinform. Res. Appl..

[32]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[34]  Yan Xu,et al.  Transcriptional profiling suggests that Barrett's metaplasia is an early intermediate stage in esophageal adenocarcinogenesis , 2006, Oncogene.

[35]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[36]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .