Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

We propose a method for online news stream clustering that is a variant of the nonparametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations and makes the clustering decision using a neural classifier. The weighted document-cluster similarity model is learned using a novel adaptation of the triplet loss into a linear classification objective. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Our model achieves a new state-of-the-art on a standard stream clustering dataset of English documents.

[1]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[2]  Preslav Nakov,et al.  Dense vs. Sparse Representations for News Stream Clustering , 2019, Text2Story@ECIR.

[3]  Svetha Venkatesh,et al.  Discovering topic structures of a temporally evolving document corpus , 2015, 1512.08008.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Suzanna Sia,et al.  Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! , 2020, EMNLP.

[6]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[7]  David M. Blei,et al.  Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[8]  David M. Blei,et al.  The Dynamic Embedded Topic Model , 2019, ArXiv.

[9]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[10]  Hien M. Nguyen,et al.  Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[11]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[12]  Julien Velcin,et al.  Inductive Document Network Embedding with Topic-Word Attention , 2020, ECIR.

[13]  Kristina Lerman,et al.  Modeling Evolution of Topics in Large-Scale Temporal Text Corpora , 2018, ICWSM.

[14]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16]  Marti A. Hearst,et al.  newsLens: building and visualizing long-ranging news stories , 2017, NEWS@ACL.

[17]  Yannis Papanikolaou,et al.  Neural Embedding Allocation: Distributed Representations of Topic Models , 2019, Computational Linguistics.

[18]  Alexander J. Smola,et al.  Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequence Data , 2017, ICML.

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[21]  David Ahn,et al.  The stages of event extraction , 2006 .

[22]  Guntis Barzdins,et al.  Multilingual Clustering of Streaming News , 2018, EMNLP.

[23]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[24]  Deyu Zhou,et al.  An Unsupervised Bayesian Modelling Approach for Storyline Detection on News Articles , 2015, EMNLP.

[25]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[26]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[27]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[28]  Marc Najork,et al.  Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models , 2019, WSDM.

[29]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[30]  Mathis Linger,et al.  Batch Clustering for Multilingual News Streaming , 2020, Text2Story@ECIR.

[31]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.