论文信息 - Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings - 字舞流文

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

We propose a method for online news stream clustering that is a variant of the nonparametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations and makes the clustering decision using a neural classifier. The weighted document-cluster similarity model is learned using a novel adaptation of the triplet loss into a linear classification objective. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Our model achieves a new state-of-the-art on a standard stream clustering dataset of English documents.

Muthu Kumar Chandrasekaran | Kailash Karthik Saravanakumar | Miguel Ballesteros | Muthu Kumar Chandrasekaran | Kathleen McKeown | K. McKeown | Miguel Ballesteros | Kailash Karthik Saravanakumar

[1] Breck Baldwin,et al. Algorithms for Scoring Coreference Chains , 1998 .

[2] Preslav Nakov,et al. Dense vs. Sparse Representations for News Stream Clustering , 2019, Text2Story@ECIR.

[3] Svetha Venkatesh,et al. Discovering topic structures of a temporally evolving document corpus , 2015, 1512.08008.

[4] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5] Suzanna Sia,et al. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! , 2020, EMNLP.

[6] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[7] David M. Blei,et al. Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[8] David M. Blei,et al. The Dynamic Embedded Topic Model , 2019, ArXiv.

[9] Xiaoqiang Luo,et al. On Coreference Resolution Performance Metrics , 2005, HLT.

[10] Hien M. Nguyen,et al. Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[11] J. Nocedal. Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[12] Julien Velcin,et al. Inductive Document Network Embedding with Topic-Word Attention , 2020, ECIR.

[13] Kristina Lerman,et al. Modeling Evolution of Topics in Large-Scale Temporal Text Corpora , 2018, ICWSM.

[14] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[15] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16] Marti A. Hearst,et al. newsLens: building and visualizing long-ranging news stories , 2017, NEWS@ACL.

[17] Yannis Papanikolaou,et al. Neural Embedding Allocation: Distributed Representations of Topic Models , 2019, Computational Linguistics.

[18] Alexander J. Smola,et al. Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequence Data , 2017, ICML.

[19] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[20] Yee Whye Teh,et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[21] David Ahn,et al. The stages of event extraction , 2006 .

[22] Guntis Barzdins,et al. Multilingual Clustering of Streaming News , 2018, EMNLP.

[23] Nir Ailon,et al. Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[24] Deyu Zhou,et al. An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles , 2015, EMNLP.

[25] Jean-Loup Guillaume,et al. Fast unfolding of communities in large networks , 2008, 0803.0476.

[26] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[27] Lucas Beyer,et al. In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[28] Marc Najork,et al. Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models , 2019, WSDM.

[29] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[30] Mathis Linger,et al. Batch Clustering for Multilingual News Streaming , 2020, Text2Story@ECIR.

[31] Ming-Wei Chang,et al. Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.