论文信息 - Discriminative Topic Mining via Category-Name Guided Text Embedding - 字舞流文

Discriminative Topic Mining via Category-Name Guided Text Embedding

Mining a set of meaningful and distinctive topics automatically from massive text corpora has broad applications. Existing topic models, however, typically work in a purely unsupervised way, which often generate topics that do not fit users’ particular needs and yield suboptimal performance on downstream tasks. We propose a new task, discriminative topic mining, which leverages a set of user-provided category names to mine discriminative topics from text corpora. This new task not only helps a user understand clearly and distinctively the topics he/she is most interested in, but also benefits directly keyword-driven classification tasks. We develop CatE, a novel category-name guided text embedding method for discriminative topic mining, which effectively leverages minimal user guidance to learn a discriminative embedding space and discover category representative terms in an iterative manner. We conduct a comprehensive set of experiments to show that CatE mines high-quality set of topics guided by category names only, and benefits a variety of downstream applications including weakly-supervised classification and lexical entailment direction identification.

Chao Zhang | Zihan Wang | Yu Zhang | Yu Meng | Jiawei Han | Jiaxin Huang | Guangyuan Wang

[1] Wei Liu,et al. Distilled Wasserstein Learning for Word Embedding and Topic Modeling , 2018, NeurIPS.

[2] Guoyin Wang,et al. Joint Embedding of Words and Labels for Text Classification , 2018, ACL.

[3] Fenglong Ma,et al. Topic Discovery for Short Texts Using Word Embeddings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[4] Ming-Wei Chang,et al. Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[5] Timothy N. Rubin,et al. Statistical topic models for multi-label document classification , 2011, Machine Learning.

[6] Yu Meng,et al. Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion , 2020, WWW.

[7] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[8] Clare R. Voss,et al. Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[9] Jiawei Han,et al. Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[10] Andrew M. Dai,et al. Embedding Text in Hyperbolic Spaces , 2018, TextGraphs@NAACL-HLT.

[11] Dan Roth,et al. On Dataless Hierarchical Text Classification , 2014, AAAI.

[12] Greg Ver Steeg,et al. Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge , 2016, TACL.

[13] John D. Lafferty,et al. Correlated Topic Models , 2005, NIPS.

[14] Hinrich Schütze,et al. Ultradense Word Embeddings by Orthogonal Transformation , 2016, NAACL.

[15] Andrew Y. Ng,et al. Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[16] ChengXiang Zhai,et al. Automatic labeling of multinomial topic models , 2007, KDD '07.

[17] Yu Meng,et al. HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[18] Ido Dagan,et al. The Distributional Inclusion Hypotheses and Lexical Entailment , 2005, ACL.

[19] Ngoc Thang Vu,et al. Hierarchical Embeddings for Hypernymy Detection and Directionality , 2017, EMNLP.

[20] Gary Bécigneul,et al. Poincaré GloVe: Hyperbolic Word Embeddings , 2018, ICLR.

[21] Thomas Hofmann,et al. Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[22] Roland Kuhn,et al. Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[23] Jiawei Han,et al. Weakly-Supervised Neural Text Classification , 2018, CIKM.

[24] Hongfei Yan,et al. SSHLDA: A Semi-Supervised Hierarchical Topic Model , 2012, EMNLP.

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Aidong Zhang,et al. Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts , 2017, KDD.

[27] Zhiyuan Liu,et al. Topical Word Embeddings , 2015, AAAI.

[28] Michael I. Jordan,et al. DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[29] Wei Li,et al. Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[30] Mark Dredze,et al. Improving Lexical Embeddings with Semantic Knowledge , 2014, ACL.

[31] Xiaojin Zhu,et al. Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[32] Timothy Baldwin,et al. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34] Ramesh Nallapati,et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[35] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36] W. Bruce Croft,et al. LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[37] Yu Meng,et al. Spherical Text Embedding , 2019, NeurIPS.

[38] Thomas L. Griffiths,et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[39] Thomas Hofmann,et al. Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[40] Thomas L. Griffiths,et al. The Author-Topic Model for Authors and Documents , 2004, UAI.

[41] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[42] David M. Blei,et al. Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[43] Dat Quoc Nguyen,et al. Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[44] Alessandro Lenci,et al. How we BLESSed distributional semantic evaluation , 2011, GEMS.

[45] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[46] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[47] I. Dhillon,et al. Matrix nearness problems in data mining , 2007 .

[48] Guoyin Wang,et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[49] Douwe Kiela,et al. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry , 2018, ICML.

[50] Brian M. Sadler,et al. TaxoGen: Constructing Topical Concept Taxonomy by Adaptive Term Embedding and Clustering , 2018, KDD 2018.

[51] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[52] Omer Levy,et al. Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[53] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54] Gao Jing,et al. Topic Discovery for Short Texts Using Word Embeddings , 2016 .

[55] Nematollah Batmanghelich,et al. Nonparametric Spherical Topic Modeling with Word Embeddings , 2016, ACL.

[56] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[57] Ji-Rong Wen,et al. WWW 2007 / Track: Search Session: Personalization A Largescale Evaluation and Analysis of Personalized Search Strategies ABSTRACT , 2022 .

[58] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[59] Hal Daumé,et al. Incorporating Lexical Priors into Topic Models , 2012, EACL.

[60] Padhraic Smyth,et al. Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[61] Dong Wang,et al. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[62] Felix Hill,et al. HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment , 2016, CL.

[63] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[64] David M. Blei,et al. Supervised Topic Models , 2007, NIPS.

[65] Douwe Kiela,et al. Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[66] Qiaozhu Mei,et al. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[67] Rajarshi Das,et al. Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[68] Qin Lu,et al. Chasing Hypernyms in Vector Spaces with Entropy , 2014, EACL.

[69] Hinrich Schütze,et al. Word Embedding Calculus in Meaningful Ultradense Subspaces , 2016, ACL.

[70] Brian M. Sadler,et al. TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering , 2018, KDD.

[71] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[72] David J. Weir,et al. Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[73] Jiawei Han,et al. Weakly-Supervised Hierarchical Text Classification , 2018, AAAI.