A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights

Logistic regression, one of the most popular machine learning binary classification methods, has been long believed to be unbiased. In this paper, we consider the "hard" classification problem of separating high dimensional Gaussian vectors, where the data dimension p and the sample size n are both large. Based on recent advances in random matrix theory (RMT) and high dimensional statistics, we evaluate the asymptotic distribution of the logistic regression classifier and consequently, provide the associated classification performance. This brings new insights into the internal mechanism of logistic regression classifier, including a possible bias in the separating hyperplane, as well as on practical issues such as hyper-parameter tuning, thereby opening the door to novel RMT-inspired improvements.

[1]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[2]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[3]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[4]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[5]  Zhenyu Liao,et al.  Classification Asymptotics in the Random Matrix Regime , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[6]  Zhenyu Liao,et al.  A Large Dimensional Analysis of Least Squares Support Vector Machines , 2017, IEEE Transactions on Signal Processing.

[7]  P. Bickel,et al.  Optimal M-estimation in high-dimensional regression , 2013, Proceedings of the National Academy of Sciences.

[8]  R. Couillet,et al.  Kernel spectral clustering of large dimensional data , 2015, 1510.03547.

[9]  Noureddine El Karoui,et al.  On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[10]  E. Candès,et al.  The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[11]  Romain Couillet,et al.  A random matrix analysis and improvement of semi-supervised learning for large dimensional data , 2017, J. Mach. Learn. Res..

[12]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[13]  Romain Couillet,et al.  Improved spectral community detection in large heterogeneous networks , 2017, J. Mach. Learn. Res..

[14]  Hanwen Huang,et al.  Asymptotic behavior of Support Vector Machine for spiked population model , 2017, J. Mach. Learn. Res..

[15]  Yuxin Chen,et al.  The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square , 2017, Probability Theory and Related Fields.