Localized user-driven topic discovery via boosted ensemble of nonnegative matrix factorization

Nonnegative matrix factorization (NMF) has been widely used in topic modeling of large-scale document corpora, where a set of underlying topics are extracted by a low-rank factor matrix from NMF. However, the resulting topics often convey only general, thus redundant information about the documents rather than information that might be minor, but potentially meaningful to users. To address this problem, we present a novel ensemble method based on nonnegative matrix factorization that discovers meaningful local topics. Our method leverages the idea of an ensemble model, which has shown advantages in supervised learning, into an unsupervised topic modeling context. That is, our model successively performs NMF given a residual matrix obtained from previous stages and generates a sequence of topic sets. The algorithm we employ to update is novel in two aspects. The first lies in utilizing the residual matrix inspired by a state-of-the-art gradient boosting model, and the second stems from applying a sophisticated local weighting scheme on the given matrix to enhance the locality of topics, which in turn delivers high-quality, focused topics of interest to users. We subsequently extend this ensemble model by adding keyword- and document-based user interaction to introduce user-driven topic discovery.

[1]  John T. Stasko,et al.  iVisClustering: An Interactive Visual Document Clustering via Topic Modeling , 2012, Comput. Graph. Forum.

[2]  Ulrik Brandes,et al.  Visual Unrolling of Network Evolution and the Analysis of Dynamic Discourse† , 2003, Inf. Vis..

[3]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[4]  Ameet Talwalkar,et al.  Divide-and-Conquer Matrix Factorization , 2011, NIPS.

[5]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[6]  Hyunsoo Kim,et al.  Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method , 2008, SIAM J. Matrix Anal. Appl..

[7]  Jaegul Choo,et al.  Weakly supervised nonnegative matrix factorization for user-driven clustering , 2014, Data Mining and Knowledge Discovery.

[8]  Jaegul Choo,et al.  Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization , 2015, KDD.

[9]  Samy Bengio,et al.  LLORMA: Local Low-Rank Matrix Approximation , 2016, J. Mach. Learn. Res..

[10]  Michael S. Bernstein,et al.  Eddi: interactive topic-based browsing of social status streams , 2010, UIST.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[13]  Mingxuan Sun,et al.  Automatic Feature Induction for Stagewise Collaborative Filtering , 2012, NIPS.

[14]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[15]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[16]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[17]  Tao Li,et al.  A Non-negative Matrix Tri-factorization Approach to Sentiment Classification with Lexical Prior Knowledge , 2009, ACL.

[18]  Ge Yu,et al.  Multimodal learning for topic sentiment analysis in microblogging , 2017, Neurocomputing.

[19]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[20]  Andrzej Cichocki,et al.  Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization , 2007, ICA.

[21]  Nicolas Gillis,et al.  Using underapproximations for sparse nonnegative matrix factorization , 2009, Pattern Recognit..

[22]  Changsheng Xu,et al.  Multi-Modal Event Topic Model for Social Event Analysis , 2016, IEEE Transactions on Multimedia.

[23]  Niklas Elmqvist,et al.  TopicLens: Efficient Multi-Level Visual Topic Exploration of Large-Scale Document Collections , 2017, IEEE Transactions on Visualization and Computer Graphics.

[24]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[25]  Joseph Sill,et al.  Feature-Weighted Linear Stacking , 2009, ArXiv.

[26]  Yoram Singer,et al.  Local Low-Rank Matrix Approximation , 2013, ICML.

[27]  Jaegul Choo,et al.  L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[28]  Haesun Park,et al.  Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons , 2011, SIAM J. Sci. Comput..

[29]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[30]  Dennis DeCoste,et al.  Collaborative prediction using ensembles of Maximum Margin Matrix Factorizations , 2006, ICML.

[31]  Giuseppe Carenini,et al.  ConVisIT: Interactive Topic Modeling for Exploring Asynchronous Online Conversations , 2015, IUI.

[32]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[33]  Emilio Ferrara,et al.  Latent Space Model for Multi-Modal Social Data , 2015, WWW.

[34]  Gene H. Golub,et al.  Matrix computations , 1983 .

[35]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[36]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[37]  Bhuva Narayan,et al.  Interactive Topic Modeling for aiding Qualitative Content Analysis , 2016, CHIIR.

[38]  Ali Ghodsi,et al.  Nonnegative matrix factorization via rank-one downdate , 2008, ICML '08.

[39]  Peng Yang,et al.  Microbial community pattern detection in human body habitats via ensemble clustering framework , 2014, BMC Systems Biology.

[40]  Derek Greene,et al.  Ensemble non-negative matrix factorization methods for clustering protein-protein interactions , 2008, Bioinform..

[41]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[42]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[43]  Bing Liu,et al.  Mining Aspect-Specific Opinion using a Holistic Lifelong Topic Model , 2016, WWW.

[44]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[45]  Alice H. Oh,et al.  Aspect and sentiment unification model for online review analysis , 2011, WSDM '11.

[46]  Hugo Larochelle,et al.  A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[48]  Haesun Park,et al.  Sparse Nonnegative Matrix Factorization for Clustering , 2008 .

[49]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[50]  Qingyao Wu,et al.  NMFE-SSCC: Non-negative matrix factorization ensemble for semi-supervised collective classification , 2015, Knowl. Based Syst..

[51]  Haesun Park,et al.  Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework , 2014, J. Glob. Optim..

[52]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[53]  Tamara Munzner,et al.  MulteeSum: A Tool for Comparative Spatial and Temporal Gene Expression Data , 2010, IEEE Transactions on Visualization and Computer Graphics.

[54]  Sougata Mukherjea,et al.  Visualizing the results of multimedia Web search engines , 1996, Proceedings IEEE Symposium on Information Visualization '96.

[55]  Haesun Park,et al.  Fast rank-2 nonnegative matrix factorization for hierarchical document clustering , 2013, KDD.

[56]  Jaegul Choo,et al.  UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[57]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.