The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R

Penalized regression models such as the lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet and ncvreg are not capable of fitting lasso-type models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, genomics, biomedical imaging, and high-frequency finance. In this research, we implement an R package called biglasso that tackles this challenge. biglasso utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle out-of-core computation seamlessly. Moreover, it's equipped with newly proposed, more efficient feature screening rules, which substantially accelerate the computation. Benchmarking experiments show that our biglasso package, as compared to existing popular ones like glmnet, is much more memory- and computation-efficient. We further analyze a 31 GB real data set on a laptop with only 16 GB RAM to demonstrate the out-of-core computation capability of biglasso in analyzing massive data sets that cannot be accommodated by existing R packages.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[4]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[7]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[8]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[9]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Georgios B. Giannakis,et al.  RLS-weighted Lasso for adaptive estimation of sparse signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  E. V. Prasad,et al.  A Critical Performance Study of Memory Mapping on Multi- Core Processors: An Experiment with k-means Algorithm with Large Data Mining Data Sets , 2010 .

[13]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[14]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[15]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[16]  Jiayu Zhou,et al.  A Safe Screening Rule for Sparse Logistic Regression , 2013, NIPS.

[17]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[18]  Minsuk Kahng,et al.  MMap: Fast billion-scale graph computation on a PC via memory mapping , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[19]  Yuefeng Li,et al.  Relevance Feature Discovery for Text Mining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Jie Wang,et al.  Lasso screening rules via dual polytope projection , 2012, J. Mach. Learn. Res..

[21]  Tianbao Yang,et al.  Efficient Feature Screening for Lasso-Type Problems via Hybrid Safe-Strong Rules , 2017, 1704.08742.

[22]  Patrick J Breheny,et al.  Marginal false discovery rates for penalized regression models. , 2016, Biostatistics.

[23]  Tianbao Yang,et al.  Hybrid safe-strong rules for efficient optimization in lasso-type problems , 2017, Comput. Stat. Data Anal..