The convergence of linear classifiers on large sparse data

Abstract Large sparse datasets are very common in product recommendation applications. For the scalability reason, linear classifiers are preferred in classification tasks on such datasets. Our previous work Li et al. (2016) has studied how sparsity affects naive Bayes classification under typical data missing mechanisms. In this paper, we greatly expand our previous work by including all linear classifiers, and explore practical strategies to improve accuracy of large sparse data classification. Using real-world and synthetic experiments, we observe different learning curve behaviors under different missing mechanisms. We also study the theoretic reasons for all our observations. Our studies provide a practical guideline to determine if or when obtaining more data and/or obtaining missing values in the data is worthwhile or not. This can be very valuable in the recommendation system applications.

[1]  Ning Chen,et al.  Dropout Training for Support Vector Machines , 2014, AAAI.

[2]  Zoubin Ghahramani,et al.  Stochastic Inference for Scalable Probabilistic Modeling of Binary Matrices , 2014, ICML.

[3]  Ohad Shamir,et al.  Efficient Learning with Partially Observed Attributes , 2010, ICML.

[4]  Krishna P. Gummadi,et al.  A measurement-driven analysis of information propagation in the flickr social network , 2009, WWW '09.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Ohad Shamir,et al.  Learning to classify with missing and corrupted features , 2008, ICML.

[7]  Qiang Yang,et al.  Transfer learning for collaborative filtering via a rating-matrix generative model , 2009, ICML '09.

[8]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[9]  Roi Livni,et al.  Classification with Low Rank and Missing Data , 2015, ICML.

[10]  James Bennett,et al.  The Netflix Prize , 2007 .

[11]  Huaimin Wang,et al.  The Convergence Behavior of Naive Bayes on Large Sparse Datasets , 2015, 2015 IEEE International Conference on Data Mining.

[12]  David M. Blei,et al.  Bayesian Nonparametric Poisson Factorization for Recommendation Systems , 2014, AISTATS.

[13]  Chun Chen,et al.  An exploration of improving collaborative recommender systems via user-item subgroups , 2012, WWW.

[14]  Lionel Médini,et al.  RED: a Rich Epinions Dataset for Recommender Systems , 2012 .

[15]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[16]  Robert D. Nowak,et al.  Transduction with Matrix Completion: Three Birds with One Stone , 2010, NIPS.

[17]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[18]  Elad Hazan,et al.  Linear Regression with Limited Observation , 2012, ICML.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[23]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[24]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[27]  Russell Greiner,et al.  Active Model Selection , 2004, UAI.

[28]  Jin Tian,et al.  Missing at Random in Graphical Models , 2015, AISTATS.

[29]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[30]  Ohad Shamir,et al.  Attribute Efficient Linear Regression with Distribution-Dependent Sampling , 2015, ICML.

[31]  Michael I. Jordan,et al.  Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates , 2003, NIPS.

[32]  Johan A. K. Suykens,et al.  Handling missing values in support vector machine classifiers , 2005, Neural Networks.

[33]  Foster J. Provost,et al.  Predictive Modeling With Big Data: Is Bigger Really Better? , 2013, Big Data.

[34]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[35]  Charles X. Ling,et al.  Clustering-based factorized collaborative filtering , 2013, RecSys.

[36]  Zoubin Ghahramani,et al.  Probabilistic Matrix Factorization with Non-random Missing Data , 2014, ICML.