论文信息 - Hybrid linear matrix factorization for topic-coherent terms clustering

Hybrid linear matrix factorization for topic-coherent terms clustering

We propose a novel Karhunen-Loeve Transformation (KLT) for dimension reduction.Karhunen-Loeve expansion based on Wiener process on KLT results for optimization.State-of-the-art topic-coherence metrics are used for word clustering and evaluation. Topic-coherent term clustering is the foundation of document organization, corpus summarization and document classification. It is especially useful in solving the emerging problem of big data. However, a term clustering method that can cope with high-dimension data with variable length and topics and meanwhile achieve high topic coherence is an ongoing request. It is a challenging problem in research. This paper proposes a hybrid linear matrix factorization method to identify the topic-coherent terms from documents to form a thesaurus for clustering. Starting from an analog Karhunen-Loeve transformation from PCA scores fully into FA's factor coefficients space (loadings), the high-dimension of the full set of PCA scores is reduced and topic-coherent terms are classified by the main factors of FA which could be topics. Karhunen-Loeve transformation reduces the total mean square error to increase topic coherence. The optimization of the initial transformation is carried out further in a manner of Karhunen-Loeve expansion based on stochastic Wiener process. The optimal topic coherent bags of terms are found to build a more topic-coherent model. This approach is experimented on the CISI, MedSH and Tweets dataset in different sizes and number of topics. It achieves outstanding results better than the methods in comparison.

Ping Liang | Sartra Wongthanavasu | S. Wongthanavasu | Ping Liang

[1] Matthias Scherf. A New Approach to Feature Selection , 1997, AIME.

[2] P. Rousseeuw,et al. Robust factor analysis , 2003 .

[3] Ruye Wang,et al. Introduction to Orthogonal Transforms: With Applications in Data Processing and Analysis , 2012 .

[4] P. Rousseeuw. Multivariate estimation with high breakdown point , 1985 .

[5] Stephen P. Boyd,et al. Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[6] Qiaozhu Mei,et al. Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[7] Ian T. Jolliffe,et al. Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[8] Phil Wood. Confirmatory Factor Analysis for Applied Research , 2008 .

[9] Qiqi Wang,et al. Optimization of Gaussian Random Fields , 2014, SIAM J. Sci. Comput..

[10] Geoffrey J. Gordon,et al. Relational learning via collective matrix factorization , 2008, KDD.

[11] Shuang-Hong Yang,et al. Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond , 2012, Mining Text Data.

[12] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[13] Guillaume Bouchard,et al. Convex Collective Matrix Factorization , 2013, AISTATS.

[14] Josef Kittler,et al. A new approach to feature selection based on the Karhunen-Loeve expansion , 1973, Pattern Recognit..

[15] David Buttler,et al. Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[16] Morteza Zahedi,et al. Improving Text Classification Performance Using PCA and Recall-Precision Criteria , 2013 .

[17] Duan Chen,et al. Dimension reduction of decision variables for multireservoir operation: A spectral optimization model , 2016 .

[18] I. Jolliffe. Principal Component Analysis , 2002 .

[19] R. Ghanem,et al. Stochastic Finite Elements: A Spectral Approach , 1990 .

[20] Jia Zeng,et al. A topic modeling toolbox using belief propagation , 2012, J. Mach. Learn. Res..

[21] Arun K. Pujari,et al. Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities , 2013, AIRS.

[22] D. Xiu. Numerical Methods for Stochastic Computations: A Spectral Method Approach , 2010 .

[23] Ian Davidson,et al. Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[24] Erik Cambria,et al. Affective Computing and Sentiment Analysis , 2016, IEEE Intelligent Systems.

[25] G BaraniukRichard,et al. Sparse factor analysis for learning and content analytics , 2014 .

[26] J. Woods,et al. Probability and Random Processes with Applications to Signal Processing , 2001 .

[27] Ali Esmaili,et al. Probability and Random Processes , 2005, Technometrics.

[28] Jérôme Pagès,et al. Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data , 2008, Comput. Stat. Data Anal..

[29] Unil Yun,et al. A fast perturbation algorithm using tree structure for privacy preserving utility mining , 2015, Expert Syst. Appl..

[30] Charu C. Aggarwal,et al. Outlier Analysis , 2013, Springer New York.

[31] Hans-Peter Kriegel,et al. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[32] Edwin V. Bonilla,et al. Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[33] Johanna D. Moore,et al. Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[34] Ying Cui,et al. Learning multiple nonredundant clusterings , 2010, TKDD.

[35] Arthur Zimek,et al. Clustering High-Dimensional Data , 2018, Data Clustering: Algorithms and Applications.

[36] Aristides Gionis,et al. Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[37] Marie-Christine Jaulent,et al. Improving information retrieval using Medical Subject Headings Concepts: a test case on rare and chronic diseases. , 2012, Journal of the Medical Library Association : JMLA.

[38] Mark Steyvers,et al. Topics in semantic representation. , 2007, Psychological review.

[39] Ran He,et al. Robust Principal Component Analysis Based on Maximum Correntropy Criterion , 2011, IEEE Transactions on Image Processing.

[40] Keinosuke Fukunaga,et al. Application of the Karhunen-Loève Expansion to Feature Selection and Ordering , 1970, IEEE Trans. Computers.

[41] Andrew McCallum,et al. Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[42] Derek Greene,et al. Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[43] Robert L. Mason,et al. Some practical aspects of covariance estimation , 1985, Pattern Recognit. Lett..

[44] Jacques Savoy,et al. Report on the TREC-3 Experiment: A Learning Scheme in a Vector Space Model , 1994, TREC.

[45] David B. Dunson,et al. Probabilistic topic models , 2011, KDD '11 Tutorials.

[46] Tsau Young Lin,et al. Clustering High Dimensional Data Using SVM , 2009, RSFDGrC.

[47] Charu C. Aggarwal,et al. An Introduction to Outlier Analysis , 2013 .

[48] Kun Lu,et al. Explicitly integrating MeSH thesaurus help into health information retrieval systems: An empirical user study , 2014, Inf. Process. Manag..

[49] M. Neale,et al. A Comparison of Factor Score Estimation Methods in the Presence of Missing Data: Reliability and an Application to Nicotine Dependence , 2013, Multivariate behavioral research.

[50] Henry Stark,et al. Probability, Random Processes, and Estimation Theory for Engineers , 1995 .

[51] David J. Marchette,et al. Enhancing Text Analysis via Dimensionality Reduction , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[52] Dimitrios Gunopulos,et al. Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[53] Yee Whye Teh,et al. On Smoothing and Inference for Topic Models , 2009, UAI.

[54] Sanjeev Arora,et al. Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[55] Guillermo Sapiro,et al. Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[56] Richard G. Baraniuk,et al. Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics , 2014, EDM.

[57] Michael I. Jordan,et al. Iterative Discovery of Multiple AlternativeClustering Views , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.