Scalable Subspace Logistic Regression Models for High Dimensional Data

Although massive, high dimensional data in the real world provide more information for logistic regression classification, yet it also means a huge challenge for us to build models accurately and efficiently. In this paper, we propose a scalable subspace logistic regression algorithm. It can be viewed as an advanced classification algorithm based on a random subspace sampling method and the traditional logistic regression algorithm, aiming to effectively deal with massive, high dimensional data. Our algorithm is particularly suitable for distributed computing environment, which we have proved, and it is implemented on Hadoop platform with MapReduce programming framework in practice. We have done several experiments using real and synthetic datasets and demonstrated better performance of our algorithm in comparison with other logistic regression algorithms.

[1]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[2]  Andrew W. Moore,et al.  Logistic regression for data mining and high-dimensional classification , 2004 .

[3]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4]  Dirk Van den Poel,et al.  FACULTEIT ECONOMIE , 2007 .

[5]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[6]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[7]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[10]  Andrew W. Moore,et al.  Making logistic regression a core data mining tool with TR-IRLS , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Jason Venner,et al.  Pro Hadoop , 2009 .

[12]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Robert I. Jennrich,et al.  Newton-Raphson and Related Algorithms for Maximum Likelihood Variance Component Estimation , 1976 .

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .