论文信息 - Online sketching of big categorical data with absent features

Online sketching of big categorical data with absent features

With the scale of data growing every day, reducing the dimensionality (a.k.a. sketching) of high-dimensional vectors has emerged as a task of increasing importance. Relevant issues to address in this context include the sheer volume of data vectors that may consist of categorical (meaning finite-alphabet) features, the typically streaming format of data acquisition, and the possibly absent features. To cope with these challenges, the present paper brings forth a novel rank-regularized maximum likelihood approach that models categorical data as quantized values of analog-amplitude features with low intrinsic dimensionality. This model along with recent online rank regularization advances are leveraged to sketch high-dimensional categorical data `on the fly.' Simulated tests with synthetic as well as real-world datasets corroborate the merits of the novel scheme relative to state-of-the-art alternatives.

Morteza Mardani | Georgios B. Giannakis | Yanning Shen

[1] Peter Tiño,et al. Using Dimensionality Reduction Method for Binary Data to Questionnaire Analysis , 2011, MEMICS.

[2] Costas J. Spanos,et al. Sequential Logistic Principal Component Analysis (SLPCA): Dimensional Reduction in Streaming Multivariate Binary-State System , 2014, 2014 13th International Conference on Machine Learning and Applications.

[3] Tapani Raiko,et al. Binary principal component analysis in the Netflix collaborative filtering task , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[4] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[5] Jan de Leeuw,et al. Principal component analysis of binary data by iterated singular value decomposition , 2006, Comput. Stat. Data Anal..

[6] Georgios B. Giannakis,et al. Online dictionary learning from big data using accelerated stochastic approximation algorithms , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Heng Tao Shen,et al. Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[8] Bradley N. Miller,et al. MovieLens unplugged: experiences with an occasionally connected recommender system , 2003, IUI '03.

[9] Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[10] Lawrence K. Saul,et al. A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[11] Albert H. Nuttall,et al. Some integrals involving the QM function (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[12] Emmanuel J. Candès,et al. Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[13] Michael E. Tipping. Probabilistic Visualisation of High-Dimensional Binary Data , 1998, NIPS.

[14] Morteza Mardani,et al. Dynamic Anomalography: Tracking Network Anomalies Via Sparsity and Low Rank , 2012, IEEE Journal of Selected Topics in Signal Processing.

[15] Jianhua Z. Huang,et al. SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA. , 2010, The annals of applied statistics.

[16] Morteza Mardani,et al. Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors , 2014, IEEE Transactions on Signal Processing.