SCStory: Self-supervised and Continual Online Story Discovery

We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.

[1]  Hwanjo Yu,et al.  Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation , 2022, EMNLP.

[2]  Susik Yoon,et al.  Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream , 2022, KDD.

[3]  Jiawei Han,et al.  TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters , 2022, WWW.

[4]  Zaiqiao Meng,et al.  TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning , 2021, NAACL-HLT.

[5]  Andrew O. Arnold,et al.  Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora , 2021, BIGSCIENCE.

[6]  Stanley Jungkyu Choi,et al.  Towards Continual Knowledge Learning of Language Models , 2021, ICLR.

[7]  Keith B. Hall,et al.  Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , 2021, FINDINGS.

[8]  Byung Suk Lee,et al.  Multiple Dynamic Outlier-Detection from a Data Stream by Exploiting Duality of Data and Queries , 2021, SIGMOD Conference.

[9]  Jialu Liu,et al.  NewsEmbed: Modeling News through Pre-trained Document Representations , 2021, KDD.

[10]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[11]  Paul N. Bennett,et al.  COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining , 2021, NeurIPS.

[12]  Tao Qi,et al.  NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application , 2021, EMNLP.

[13]  Muthu Kumar Chandrasekaran,et al.  Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings , 2021, EACL.

[14]  Philip S. Yu,et al.  Knowledge-Preserving Incremental Social Event Detection via Heterogeneous GNNs , 2021, WWW.

[15]  Madian Khabsa,et al.  CLEAR: Contrastive Learning for Sentence Representation , 2020, ArXiv.

[16]  Jingfei Du,et al.  Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , 2020, ICLR.

[17]  Philip S. Yu,et al.  Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks , 2020, ArXiv.

[18]  Jae-Gil Lee,et al.  Ultrafast Local Outlier Detection from a Data Stream with Stationary Region Skipping , 2020, KDD.

[19]  Xing Xie,et al.  MIND: A Large-scale Dataset for News Recommendation , 2020, ACL.

[20]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[21]  Linglong Kong,et al.  Story Forest , 2020, ACM Trans. Knowl. Discov. Data.

[22]  John Glover,et al.  A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal , 2020, ACL.

[23]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[24]  Mathis Linger,et al.  Batch Clustering for Multilingual News Streaming , 2020, Text2Story@ECIR.

[25]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[26]  Jiawei Han,et al.  Generating Representative Headlines for News Stories , 2020, WWW.

[27]  Xiang Ren,et al.  Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[28]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[29]  Byung Suk Lee,et al.  NETS: Extremely Fast Outlier Detection from a Data Stream via Set-Based Processing , 2019, Proc. VLDB Endow..

[30]  Hongyu Guo,et al.  Augmenting Data with Mixup for Sentence Classification: An Empirical Study , 2019, ArXiv.

[31]  Jae-Gil Lee,et al.  CEP-Wizard: Automatic Deployment of Distributed Complex Event Processing , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[32]  Guntis Barzdins,et al.  Multilingual Clustering of Streaming News , 2018, EMNLP.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Yu Xu,et al.  Growing Story Forest Online from Massive Breaking News , 2017, CIKM.

[35]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[36]  Liyuan Liu,et al.  TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams , 2017, KDD.

[37]  Marti A. Hearst,et al.  newsLens: building and visualizing long-ranging news stories , 2017, NEWS@ACL.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[40]  Xin-Yu Dai,et al.  Unsupervised Storyline Extraction from News Articles , 2016, IJCAI.

[41]  M. Cugmas,et al.  On comparing partitions , 2015 .

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Gerhard Weikum,et al.  EVIN: building a knowledge base of events , 2014, WWW.

[44]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[45]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[46]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[47]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[48]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[49]  Bang Liu Story Forest: Extracting Events and Telling Stories from Breaking News , 2020 .

[50]  Preslav Nakov,et al.  Dense vs. Sparse Representations for News Stream Clustering , 2019, Text2Story@ECIR.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[53]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[54]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[55]  Philip S. Yu,et al.  Under Consideration for Publication in Knowledge and Information Systems on Clustering Massive Text and Categorical Data Streams , 2022 .