A likelihood-based approach for multivariate categorical response regression in high dimensions

We propose a penalized likelihood method to fit the bivariate categorical response regression model. Our method allows practitioners to estimate which predictors are irrelevant, which predictors only affect the marginal distributions of the bivariate response, and which predictors affect both the marginal distributions and log odds ratios. To compute our estimator, we propose an efficient first order algorithm which we extend to settings where some subjects have only one response variable measured, i.e., the semi-supervised setting. We derive an asymptotic error bound which illustrates the performance of our estimator in high-dimensional settings. Generalizations to the multivariate categorical response regression model are proposed. Finally, simulation studies and an application in pan-cancer risk prediction demonstrate the usefulness of our method in terms of interpretability and prediction accuracy. An R package implementing the proposed method is available for download at this http URL.

[1]  Jieping Ye,et al.  Efficient Methods for Overlapping Group Lasso , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[3]  Volkan Cevher,et al.  Composite Convex Minimization Involving Self-concordant-Like Cost Functions , 2015, MCO.

[4]  Adam J. Rothman,et al.  Shrinking characteristics of precision matrix estimators , 2017, 1704.04820.

[5]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[6]  Eyke Hüllermeier,et al.  On the Problem of Error Propagation in Classifier Chains for Multi-label Classification , 2012, GfKl.

[7]  P. McCullagh,et al.  Multivariate Logistic Models , 1995 .

[8]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[9]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[10]  Yi Yang,et al.  Multiclass Sparse Discriminant Analysis , 2015, 1504.05845.

[11]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[12]  J. Serth,et al.  Caveolin 1 protein expression in renal cell carcinoma predicts survival , 2011, BMC urology.

[13]  Eyke Hüllermeier,et al.  Dependent binary relevance models for multi-label classification , 2014, Pattern Recognit..

[14]  Jean-Philippe Vert,et al.  Group Lasso with Overlaps: the Latent Group Lasso approach , 2011, ArXiv.

[15]  Niels Richard Hansen,et al.  Sparse group lasso and high dimensional multinomial classification , 2012, Comput. Stat. Data Anal..

[16]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[17]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[18]  J. Anderson Regression and Ordered Categorical Variables , 1984 .

[19]  Kenneth Lange,et al.  MM optimization algorithms , 2016 .

[20]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[21]  Bradley S. Price,et al.  Automatic Response Category Combination in Multinomial Logistic Regression , 2017, Journal of Computational and Graphical Statistics.

[22]  Yucheng Dong,et al.  A Unified Framework , 2018, Linguistic Decision Making.

[23]  Geoff Holmes,et al.  Classifier Chains for Multi-label Classification , 2009, ECML/PKDD.

[24]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[25]  A. Agresti Categorical data analysis , 1993 .

[26]  Eyke Hüllermeier,et al.  Rectifying Classifier Chains for Multi-Label Classification , 2019, LWA.

[27]  Trevor Hastie,et al.  Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball , 2018, Statistical modelling.

[28]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[29]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[30]  B. Qaqish,et al.  Multivariate logistic models , 2006 .

[31]  Wei Sun,et al.  Gaussian process regression for survival time prediction with genome-wide gene expression. , 2018, Biostatistics.

[32]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[33]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[34]  J. Bien,et al.  Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations , 2015, 1512.01631.

[35]  Xin Geng,et al.  Binary relevance for multi-label learning: an overview , 2018, Frontiers of Computer Science.