Sparse Block Regression (SBR) for Big Data with Categorical Variables

Categorical variables are nominal variables that classify observations by groups. The treatment of categorical variables in regression is a well-studied yet vital problem, with the most popular solution to perform a one hot encoding. However, challenges arise if a categorical variable has millions of levels. It will cause the memory needed for the computation far exceeds the total available memory in a given computer system or even a computer cluster. Thus, it is fair to state that one hot encoding approach has its limitations when a categorical variable has a large number of levels. The common workaround is the sparse matrix approach because it requires much fewer resources to cache the dummy variables. However, existing sparse matrix approaches are still not sufficient to handle extreme cases when a categorical variable has millions of levels. For instance, the number of subnets in network traffic analyses can easily exceeds tens of millions. In this paper, we proposed an innovative approach called sparse block regression (SBR) to address this challenge. SBR constructs a sparse block matrix using sufficient statistics. The benefits include but not limited to: 1) overcome the memory barrier issue caused by one hot encoding, 2) obtain multiple models with a single scan of data stored in the secondary storage; and 3) update the models with simple matrix operations. The study compared proposed SBR against conventional sparse matrix approaches. The experiments proved that SBR can efficiently and accurately solve the regression problem with large category number. Compared to the sparse matrix approach, SBR saved 90% memory in size during the computation.

[1]  Baijian Yang,et al.  An exact approach to ridge regression for big data , 2017, Computational Statistics.

[2]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[3]  Xiang Liu,et al.  Multiple Learning for Regression in Big Data , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[4]  Yun Gu,et al.  LINEAR MODELS FOR CLASSIFICATION , 2014 .

[5]  Baijian Yang,et al.  Box–Cox Transformation in Big Data , 2017, Technometrics.

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Yili Hong,et al.  Reliability Meets Big Data: Opportunities and Challenges , 2014 .

[8]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[9]  Zhengping Qian,et al.  MadLINQ: large-scale distributed matrix computation for the cloud , 2012, EuroSys '12.

[10]  Rong Gu,et al.  Efficient large scale distributed matrix computation with spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[11]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[12]  Baijian Yang,et al.  Dimension reduction for big data , 2018 .

[13]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[14]  Baijian Yang,et al.  Big Data Dimension Reduction Using PCA , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).

[15]  Jun Wang,et al.  Product-Based Neural Networks for User Response Prediction , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[16]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[17]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..