Online sketching of big categorical data with absent features

With the scale of data growing every day, reducing the dimensionality (a.k.a. sketching) of high-dimensional vectors has emerged as a task of increasing importance. Relevant issues to address in this context include the sheer volume of data vectors that may consist of categorical (meaning finite-alphabet) features, the typically streaming format of data acquisition, and the possibly absent features. To cope with these challenges, the present paper brings forth a novel rank-regularized maximum likelihood approach that models categorical data as quantized values of analog-amplitude features with low intrinsic dimensionality. This model along with recent online rank regularization advances are leveraged to sketch high-dimensional categorical data `on the fly.' Simulated tests with synthetic as well as real-world datasets corroborate the merits of the novel scheme relative to state-of-the-art alternatives.

[1]  Peter Tiño,et al.  Using Dimensionality Reduction Method for Binary Data to Questionnaire Analysis , 2011, MEMICS.

[2]  Costas J. Spanos,et al.  Sequential Logistic Principal Component Analysis (SLPCA): Dimensional Reduction in Streaming Multivariate Binary-State System , 2014, 2014 13th International Conference on Machine Learning and Applications.

[3]  Tapani Raiko,et al.  Binary principal component analysis in the Netflix collaborative filtering task , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Jan de Leeuw,et al.  Principal component analysis of binary data by iterated singular value decomposition , 2006, Comput. Stat. Data Anal..

[6]  Georgios B. Giannakis,et al.  Online dictionary learning from big data using accelerated stochastic approximation algorithms , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[8]  Bradley N. Miller,et al.  MovieLens unplugged: experiences with an occasionally connected recommender system , 2003, IUI '03.

[9]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[10]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[11]  Albert H. Nuttall,et al.  Some integrals involving the QM function (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[12]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[13]  Michael E. Tipping Probabilistic Visualisation of High-Dimensional Binary Data , 1998, NIPS.

[14]  Morteza Mardani,et al.  Dynamic Anomalography: Tracking Network Anomalies Via Sparsity and Low Rank , 2012, IEEE Journal of Selected Topics in Signal Processing.

[15]  Jianhua Z. Huang,et al.  SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA. , 2010, The annals of applied statistics.

[16]  Morteza Mardani,et al.  Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors , 2014, IEEE Transactions on Signal Processing.