Sparse Poisson coding for high dimensional document clustering

Document clustering plays an important role in large scale textual data analysis, which generally faces with great challenge of the high dimensional textual data. One remedy is to learn the high-level sparse representation by the sparse coding techniques. In contrast to traditional Gaussian noise-based sparse coding methods, in this paper, we employ a Poisson distribution model to represent the word-count frequency feature of a text for sparse coding. Moreover, a novel sparse-constrained Poisson regression algorithm is proposed to solve the induced optimization problem. Different from previous Poisson regression with the family of ℓ1-regularization to enhance the sparse solution, we introduce a sparsity ratio measure which make use of both ℓ1-norm and ℓ2-norm on the learned weight. An important advantage of the sparsity ratio is that it bounded in the range of 0 and 1. This makes it easy to set for practical applications. To further make the algorithm trackable for the high dimensional textual data, a projected gradient descent algorithm is proposed to solve the regression problem. Extensive experiments have been conducted to show that our proposed approach can achieve effective representation for document clustering compared with state-of-the-art regression methods.

[1]  Deng Cai,et al.  Bilevel Visual Words Coding for Image Classification , 2013, IJCAI.

[2]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Rebecca Willett,et al.  Sparsity-regularized photon-limited imaging , 2010, 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[4]  John F. Canny,et al.  Behavioral Targeting: The Art of Scaling Up Simple Algorithms , 2010, TKDD.

[5]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[6]  Rebecca Willett,et al.  This is SPIRAL-TAP: Sparse Poisson Intensity Reconstruction ALgorithms—Theory and Practice , 2010, IEEE Transactions on Image Processing.

[7]  Tai Sing Lee,et al.  Accounting for network effects in neuronal responses using L1 regularized point process models , 2010, NIPS.

[8]  Michael R. Lyu,et al.  Fused Matrix Factorization with Geographical and Social Influence in Location-Based Social Networks , 2012, AAAI.

[9]  Zenglin Xu,et al.  Efficient Sparse Generalized Multiple Kernel Learning , 2011, IEEE Transactions on Neural Networks.

[10]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[11]  Michael R. Lyu,et al.  Where You Like to Go Next: Successive Point-of-Interest Recommendation , 2013, IJCAI.

[12]  Michael R. Lyu,et al.  Efficient online learning for multitask feature selection , 2013, TKDD.

[13]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[14]  Le Li,et al.  SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding: SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding , 2009 .

[15]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[16]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[17]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[18]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  R LyuMichael,et al.  Efficient online learning for multitask feature selection , 2013 .

[20]  Ying He,et al.  Retrieval-Based Face Annotation by Weak Label Regularized Local Coordinate Coding , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yuan Yan Tang,et al.  Document Clustering in Correlation Similarity Measure Space , 2012, IEEE Transactions on Knowledge and Data Engineering.

[22]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[23]  Christoph Schnörr,et al.  Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming , 2006, J. Mach. Learn. Res..

[24]  Ameet Talwalkar,et al.  On sampling-based approximate spectral decomposition , 2009, ICML '09.

[25]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[26]  Pamela C. Cosman,et al.  Vector quantization of image subbands: a survey , 1996, IEEE Trans. Image Process..