Hybrid linear matrix factorization for topic-coherent terms clustering

We propose a novel Karhunen-Loeve Transformation (KLT) for dimension reduction.Karhunen-Loeve expansion based on Wiener process on KLT results for optimization.State-of-the-art topic-coherence metrics are used for word clustering and evaluation. Topic-coherent term clustering is the foundation of document organization, corpus summarization and document classification. It is especially useful in solving the emerging problem of big data. However, a term clustering method that can cope with high-dimension data with variable length and topics and meanwhile achieve high topic coherence is an ongoing request. It is a challenging problem in research. This paper proposes a hybrid linear matrix factorization method to identify the topic-coherent terms from documents to form a thesaurus for clustering. Starting from an analog Karhunen-Loeve transformation from PCA scores fully into FA's factor coefficients space (loadings), the high-dimension of the full set of PCA scores is reduced and topic-coherent terms are classified by the main factors of FA which could be topics. Karhunen-Loeve transformation reduces the total mean square error to increase topic coherence. The optimization of the initial transformation is carried out further in a manner of Karhunen-Loeve expansion based on stochastic Wiener process. The optimal topic coherent bags of terms are found to build a more topic-coherent model. This approach is experimented on the CISI, MedSH and Tweets dataset in different sizes and number of topics. It achieves outstanding results better than the methods in comparison.

[1]  Matthias Scherf A New Approach to Feature Selection , 1997, AIME.

[2]  P. Rousseeuw,et al.  Robust factor analysis , 2003 .

[3]  Ruye Wang,et al.  Introduction to Orthogonal Transforms: With Applications in Data Processing and Analysis , 2012 .

[4]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[5]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[6]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[7]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[8]  Phil Wood Confirmatory Factor Analysis for Applied Research , 2008 .

[9]  Qiqi Wang,et al.  Optimization of Gaussian Random Fields , 2014, SIAM J. Sci. Comput..

[10]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[11]  Shuang-Hong Yang,et al.  Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond , 2012, Mining Text Data.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Guillaume Bouchard,et al.  Convex Collective Matrix Factorization , 2013, AISTATS.

[14]  Josef Kittler,et al.  A new approach to feature selection based on the Karhunen-Loeve expansion , 1973, Pattern Recognit..

[15]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[16]  Morteza Zahedi,et al.  Improving Text Classification Performance Using PCA and Recall-Precision Criteria , 2013 .

[17]  Duan Chen,et al.  Dimension reduction of decision variables for multireservoir operation: A spectral optimization model , 2016 .

[18]  I. Jolliffe Principal Component Analysis , 2002 .

[19]  R. Ghanem,et al.  Stochastic Finite Elements: A Spectral Approach , 1990 .

[20]  Jia Zeng,et al.  A topic modeling toolbox using belief propagation , 2012, J. Mach. Learn. Res..

[21]  Arun K. Pujari,et al.  Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities , 2013, AIRS.

[22]  D. Xiu Numerical Methods for Stochastic Computations: A Spectral Method Approach , 2010 .

[23]  Ian Davidson,et al.  Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[24]  Erik Cambria,et al.  Affective Computing and Sentiment Analysis , 2016, IEEE Intelligent Systems.

[25]  G BaraniukRichard,et al.  Sparse factor analysis for learning and content analytics , 2014 .

[26]  J. Woods,et al.  Probability and Random Processes with Applications to Signal Processing , 2001 .

[27]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[28]  Jérôme Pagès,et al.  Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data , 2008, Comput. Stat. Data Anal..

[29]  Unil Yun,et al.  A fast perturbation algorithm using tree structure for privacy preserving utility mining , 2015, Expert Syst. Appl..

[30]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[31]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[32]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[33]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[34]  Ying Cui,et al.  Learning multiple nonredundant clusterings , 2010, TKDD.

[35]  Arthur Zimek,et al.  Clustering High-Dimensional Data , 2018, Data Clustering: Algorithms and Applications.

[36]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[37]  Marie-Christine Jaulent,et al.  Improving information retrieval using Medical Subject Headings Concepts: a test case on rare and chronic diseases. , 2012, Journal of the Medical Library Association : JMLA.

[38]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[39]  Ran He,et al.  Robust Principal Component Analysis Based on Maximum Correntropy Criterion , 2011, IEEE Transactions on Image Processing.

[40]  Keinosuke Fukunaga,et al.  Application of the Karhunen-Loève Expansion to Feature Selection and Ordering , 1970, IEEE Trans. Computers.

[41]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[42]  Derek Greene,et al.  Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[43]  Robert L. Mason,et al.  Some practical aspects of covariance estimation , 1985, Pattern Recognit. Lett..

[44]  Jacques Savoy,et al.  Report on the TREC-3 Experiment: A Learning Scheme in a Vector Space Model , 1994, TREC.

[45]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[46]  Tsau Young Lin,et al.  Clustering High Dimensional Data Using SVM , 2009, RSFDGrC.

[47]  Charu C. Aggarwal,et al.  An Introduction to Outlier Analysis , 2013 .

[48]  Kun Lu,et al.  Explicitly integrating MeSH thesaurus help into health information retrieval systems: An empirical user study , 2014, Inf. Process. Manag..

[49]  M. Neale,et al.  A Comparison of Factor Score Estimation Methods in the Presence of Missing Data: Reliability and an Application to Nicotine Dependence , 2013, Multivariate behavioral research.

[50]  Henry Stark,et al.  Probability, Random Processes, and Estimation Theory for Engineers , 1995 .

[51]  David J. Marchette,et al.  Enhancing Text Analysis via Dimensionality Reduction , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[52]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[53]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[54]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[55]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[56]  Richard G. Baraniuk,et al.  Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics , 2014, EDM.

[57]  Michael I. Jordan,et al.  Iterative Discovery of Multiple AlternativeClustering Views , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.