Learning using Large Datasets

This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation– estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. For instance, a mediocre optimization algorithms, stochastic gradient descent, is shown to perform very well on large-scale learning problems.

[1]  Alex Bijlsma Annales de la Faculté des Sciences de Toulouse , .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[4]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[6]  J. Stephen Judd,et al.  On the complexity of loading shallow neural networks , 1988, J. Complex..

[7]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[8]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[9]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[10]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[11]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[12]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[13]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[14]  O. Bousquet Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms , 2002 .

[15]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[16]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[17]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[18]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[19]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[20]  Ingo Steinwart,et al.  Fast Rates for Support Vector Machines , 2005, COLT.

[21]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[22]  P. Bartlett,et al.  Empirical minimization , 2006 .

[23]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[24]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[25]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[26]  Don R. Hush,et al.  QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines , 2006, J. Mach. Learn. Res..

[27]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[28]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.