Discriminative Topic Mining via Category-Name Guided Text Embedding

Mining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Existing topic models, however, typically work in a purely unsupervised way, which often generate topics that do not fit users’ particular needs and yield suboptimal performance on downstream tasks. We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This new task not only helps a user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop CatE, a novel category-name guided text embedding method for discriminative topic mining, which effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. We conduct a comprehensive set of experiments to show that CatE mines high-quality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.

[1]  Wei Liu,et al.  Distilled Wasserstein Learning for Word Embedding and Topic Modeling , 2018, NeurIPS.

[2]  Guoyin Wang,et al.  Joint Embedding of Words and Labels for Text Classification , 2018, ACL.

[3]  Fenglong Ma,et al.  Topic Discovery for Short Texts Using Word Embeddings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[4]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[5]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[6]  Yu Meng,et al.  Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion , 2020, WWW.

[7]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[8]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[9]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[10]  Andrew M. Dai,et al.  Embedding Text in Hyperbolic Spaces , 2018, TextGraphs@NAACL-HLT.

[11]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[12]  Greg Ver Steeg,et al.  Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge , 2016, TACL.

[13]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[14]  Hinrich Schütze,et al.  Ultradense Word Embeddings by Orthogonal Transformation , 2016, NAACL.

[15]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[16]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[17]  Yu Meng,et al.  HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[18]  Ido Dagan,et al.  The Distributional Inclusion Hypotheses and Lexical Entailment , 2005, ACL.

[19]  Ngoc Thang Vu,et al.  Hierarchical Embeddings for Hypernymy Detection and Directionality , 2017, EMNLP.

[20]  Gary Bécigneul,et al.  Poincaré GloVe: Hyperbolic Word Embeddings , 2018, ICLR.

[21]  Thomas Hofmann,et al.  Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[22]  Roland Kuhn,et al.  Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[23]  Jiawei Han,et al.  Weakly-Supervised Neural Text Classification , 2018, CIKM.

[24]  Hongfei Yan,et al.  SSHLDA: A Semi-Supervised Hierarchical Topic Model , 2012, EMNLP.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Aidong Zhang,et al.  Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts , 2017, KDD.

[27]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[28]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[29]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[30]  Mark Dredze,et al.  Improving Lexical Embeddings with Semantic Knowledge , 2014, ACL.

[31]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[32]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[37]  Yu Meng,et al.  Spherical Text Embedding , 2019, NeurIPS.

[38]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[39]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[40]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[41]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[42]  David M. Blei,et al.  Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[43]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[44]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[45]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[46]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[47]  I. Dhillon,et al.  Matrix nearness problems in data mining , 2007 .

[48]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[49]  Douwe Kiela,et al.  Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry , 2018, ICML.

[50]  Brian M. Sadler,et al.  TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering , 2018, KDD 2018.

[51]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[52]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[53]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54]  Gao Jing,et al.  Topic Discovery for Short Texts Using Word Embeddings , 2016 .

[55]  Nematollah Batmanghelich,et al.  Nonparametric Spherical Topic Modeling with Word Embeddings , 2016, ACL.

[56]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[57]  Ji-Rong Wen,et al.  WWW 2007 / Track: Search Session: Personalization A Largescale Evaluation and Analysis of Personalized Search Strategies ABSTRACT , 2022 .

[58]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[59]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[60]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[61]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[62]  Felix Hill,et al.  HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment , 2016, CL.

[63]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[64]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[65]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[66]  Qiaozhu Mei,et al.  PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[67]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[68]  Qin Lu,et al.  Chasing Hypernyms in Vector Spaces with Entropy , 2014, EACL.

[69]  Hinrich Schütze,et al.  Word Embedding Calculus in Meaningful Ultradense Subspaces , 2016, ACL.

[70]  Brian M. Sadler,et al.  TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering , 2018, KDD.

[71]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[72]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[73]  Jiawei Han,et al.  Weakly-Supervised Hierarchical Text Classification , 2018, AAAI.