Online censoring for large-scale regressions

As every day 2.5 quintillion bytes of data are generated, the era of Big Data is undoubtedly upon us. Nonetheless, a significant percentage of the data accrued can be omitted while maintaining a certain quality of statistical inference with a limited computational budget. In this context, estimating adaptively high-dimensional signals from massive data observed sequentially is challenging but equally important in practice. The present paper deals with this challenge based on a novel approach that leverages interval censoring for data reduction. An online maximum likelihood, least mean-square (LMS)-type algorithm, and an online support vector regression algorithm are developed for censored data. The proposed algorithms entail simple, low-complexity, closed-form updates, and have provably bounded regret. Simulated tests corroborate their efficacy.

[1]  Deanna Needell,et al.  Stochastic gradient descent and the randomized Kaczmarz algorithm , 2013, ArXiv.

[2]  Michael W. Mahoney Algorithmic and Statistical Perspectives on Large-Scale Data Analysis , 2010, ArXiv.

[3]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[4]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[5]  Wei Chu,et al.  A Support Vector Approach to Censored Targets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Alejandro Ribeiro,et al.  Bandwidth-constrained distributed estimation for wireless sensor Networks-part I: Gaussian case , 2006, IEEE Transactions on Signal Processing.

[7]  G. Giannakis,et al.  Modeling And Optimization For Big Data Analytics , 2014 .

[8]  T. Amemiya Tobit models: A survey , 1984 .

[9]  Ludger Evers,et al.  Sparse kernel methods for high-dimensional survival data , 2008, Bioinform..

[10]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[11]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[12]  Gang Wang,et al.  Power Scheduling of Kalman Filtering in Wireless Sensor Networks with Data Packet Drops , 2013 .

[13]  James Theiler,et al.  Accurate On-line Support Vector Regression , 2003, Neural Computation.

[14]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[15]  Econo Metrica REGRESSION ANALYSIS WHEN THE DEPENDENT VARIABLE IS TRUNCATED NORMAL , 2016 .

[16]  Morteza Mardani,et al.  Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors , 2014, IEEE Transactions on Signal Processing.

[17]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[18]  Gonzalo Mateos,et al.  Stochastic Approximation vis-a-vis Online Learning for Big Data Analytics [Lecture Notes] , 2014, IEEE Signal Processing Magazine.

[19]  Georgios B. Giannakis,et al.  Sensor-Centric Data Reduction for Estimation With WSNs via Censoring and Quantization , 2012, IEEE Transactions on Signal Processing.

[20]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[21]  Lihua Xie,et al.  Asymptotically Optimal Parameter Estimation With Scheduled Measurements , 2013, IEEE Transactions on Signal Processing.

[22]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[23]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[24]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Yaniv Plan,et al.  One‐Bit Compressed Sensing by Linear Programming , 2011, ArXiv.

[26]  J. Tobin Estimation of Relationships for Limited Dependent Variables , 1958 .

[27]  Victor Solo,et al.  The stability of LMS , 1997, IEEE Trans. Signal Process..

[28]  Gonzalo Mateos,et al.  Modeling and Optimization for Big Data Analytics: (Statistical) learning tools for our era of data deluge , 2014, IEEE Signal Processing Magazine.