Regularization and feature selection for networked features

In the standard formalization of supervised learning problems, a datum is represented as a vector of features without prior knowledge about relationships among features. However, for many real world problems, we have such prior knowledge about structure relationships among features. For instance, in Microarray analysis where the genes are features, the genes form biological pathways. Such prior knowledge should be incorporated to build a more accurate and interpretable model, especially in applications with high dimensionality and low sample sizes. Towards an efficient incorporation of the structure relationships, we have designed a classification model where we use an undirected graph to capture the relationship of features. In our method, we combine both L1 norm and Laplacian based L2 norm regularization with logistic regression. In this approach, we enforce model sparsity and smoothness among features to identify a small subset of grouped features. We have derived efficient optimization algorithms based on coordinate decent for the new formulation. Using comprehensive experimental study, we have demonstrated the effectiveness of the proposed learning methods.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[3]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[4]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[7]  John Blitzer,et al.  Regularized Learning with Networks of Features , 2008, NIPS.

[8]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[11]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[12]  Yi Lu,et al.  MCM-test: a fuzzy-set-theory-based approach to differential analysis of gene pathways , 2008, BMC Bioinformatics.

[13]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.