BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.

[1]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[2]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[3]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[4]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[5]  Zheng Zhang,et al.  A Deep Learning Framework for Pricing Financial Instruments , 2019, ArXiv.

[6]  Varun Kanade,et al.  From which world is your graph , 2017, NIPS.

[7]  Junping Du,et al.  Short Text Analysis Based on Dual Semantic Extension and Deep Hashing in Microblog , 2019, ACM Trans. Intell. Syst. Technol..

[8]  Michael W. Berry,et al.  Text Mining Using Non-Negative Matrix Factorizations , 2004, SDM.

[9]  David F. Gleich,et al.  Revisiting Power-law Distributions in Spectra of Real World Networks , 2017, KDD.

[10]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Xuanjing Huang,et al.  Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter , 2016, EMNLP.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Qiong Wu,et al.  Adaptive Reduced Rank Regression , 2019, NeurIPS.

[15]  Carlee Joe-Wong,et al.  Network-Aware Optimization of Distributed Learning for Fog Computing , 2020, IEEE/ACM Transactions on Networking.

[16]  Xindong Wu,et al.  Topic Modeling over Short Texts by Incorporating Word Embeddings , 2016, PAKDD.

[17]  David Sarne,et al.  Unsupervised Topic Extraction from Privacy Policies , 2019, WWW.

[18]  Malek Hajjem,et al.  Combining IR and LDA Topic Modeling for Filtering Microblogs , 2017, KES.

[19]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[20]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[21]  Johannes Schneider,et al.  Topic Modeling based on Keywords and Context , 2017, SDM.

[22]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[23]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[24]  Dingcheng Li,et al.  Integration of Knowledge Graph Embedding Into Topic Modeling with Hierarchical Dirichlet Process , 2019, NAACL.

[25]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[26]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[27]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  H. Vincent Poor,et al.  On the Efficiency of Online Social Learning Networks , 2018, IEEE/ACM Transactions on Networking.

[29]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[30]  Bin Yu,et al.  Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim , 2012, 1204.2296.

[31]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[32]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[33]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[34]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[35]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[36]  Franck Cappello,et al.  Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer , 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER).

[37]  Mung Chiang,et al.  Personalized Thread Recommendation for MOOC Discussion Forums , 2018, ECML/PKDD.

[38]  Rui Zhang,et al.  Incorporating Knowledge Graph Embeddings into Topic Modeling , 2017, AAAI.

[39]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[40]  Joemon M. Jose,et al.  Text segmentation: A topic modeling perspective , 2011, Inf. Process. Manag..

[41]  Qiong Wu,et al.  Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters , 2020, 2021 17th International Conference on Mobility, Sensing and Networking (MSN).

[42]  Fan Chung Graham,et al.  Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model , 2012, COLT.

[43]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[44]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[45]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[46]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[47]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[48]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[49]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[50]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[51]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[52]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[53]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[54]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[55]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[56]  Tao Zhang,et al.  Fog and IoT: An Overview of Research Opportunities , 2016, IEEE Internet of Things Journal.

[57]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[58]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[59]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[60]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[61]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[62]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[63]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[64]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[65]  Hurst Jw,et al.  The Oral Cavity and Associated Structures -- Clinical Methods: The History, Physical, and Laboratory Examinations , 1990 .

[66]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[67]  Dario Pompili,et al.  Collaborative Mobile Edge Computing in 5G Networks: New Paradigms, Scenarios, and Challenges , 2016, IEEE Communications Magazine.

[68]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[69]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[70]  Joakim Nivre,et al.  A Dynamic Oracle for Arc-Eager Dependency Parsing , 2012, COLING.

[71]  Gang Liu,et al.  MetaLDA: A Topic Model that Efficiently Incorporates Meta Information , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[72]  Raymond Y. K. Lau,et al.  Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model , 2017, ACM Trans. Intell. Syst. Technol..

[73]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[74]  S. Niwattanakul,et al.  Using of Jaccard Coefficient for Keywords Similarity , 2022 .

[75]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[76]  Tai Qin,et al.  Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel , 2013, NIPS.

[77]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[78]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[79]  Christopher G. Brinton,et al.  The Power of Networks: Six Principles That Connect Our Lives , 2016 .

[80]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[81]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[82]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[83]  Sergey I. Nikolenko,et al.  Topic Quality Metrics Based on Distributed Word Representations , 2016, SIGIR.

[84]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[85]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[86]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[87]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[88]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[89]  Andreas Stathopoulos,et al.  PRIMME: preconditioned iterative multimethod eigensolver—methods and software description , 2010, TOMS.

[90]  Dimo Angelov,et al.  Top2Vec: Distributed Representations of Topics , 2020, ArXiv.

[91]  Hui Xiong,et al.  Dynamic Word Embeddings for Evolving Semantic Discovery , 2017, WSDM.