BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.

[1]  Franck Cappello,et al.  Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer , 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Qiong Wu,et al.  Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters , 2020, 2021 17th International Conference on Mobility, Sensing and Networking (MSN).

[3]  Dimo Angelov,et al.  Top2Vec: Distributed Representations of Topics , 2020, ArXiv.

[4]  Christopher G. Brinton,et al.  Network-Aware Optimization of Distributed Learning for Fog Computing , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[5]  Carlee Joe-Wong,et al.  Network-Aware Optimization of Distributed Learning for Fog Computing , 2020, IEEE/ACM Transactions on Networking.

[6]  Zheng Zhang,et al.  A Deep Learning Framework for Pricing Financial Instruments , 2019, ArXiv.

[7]  Junping Du,et al.  Short Text Analysis Based on Dual Semantic Extension and Deep Hashing in Microblog , 2019, ACM Trans. Intell. Syst. Technol..

[8]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[9]  Dingcheng Li,et al.  Integration of Knowledge Graph Embedding Into Topic Modeling with Hierarchical Dirichlet Process , 2019, NAACL.

[10]  David Sarne,et al.  Unsupervised Topic Extraction from Privacy Policies , 2019, WWW.

[11]  Qiong Wu,et al.  Adaptive Reduced Rank Regression , 2019, NeurIPS.

[12]  Ryan Cotterell,et al.  Gender Bias in Contextualized Word Embeddings , 2019, NAACL.

[13]  H. Vincent Poor,et al.  On the Efficiency of Online Social Learning Networks , 2018, IEEE/ACM Transactions on Networking.

[14]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[15]  Mung Chiang,et al.  Personalized Thread Recommendation for MOOC Discussion Forums , 2018, ECML/PKDD.

[16]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[17]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[18]  Varun Kanade,et al.  From which world is your graph , 2017, NIPS.

[19]  Johannes Schneider,et al.  Topic Modeling based on Keywords and Context , 2017, SDM.

[20]  Gang Liu,et al.  MetaLDA: A Topic Model that Efficiently Incorporates Meta Information , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[21]  Raymond Y. K. Lau,et al.  Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model , 2017, ACM Trans. Intell. Syst. Technol..

[22]  David F. Gleich,et al.  Revisiting Power-law Distributions in Spectra of Real World Networks , 2017, KDD.

[23]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[26]  Hui Xiong,et al.  Dynamic Word Embeddings for Evolving Semantic Discovery , 2017, WSDM.

[27]  Rui Zhang,et al.  Incorporating Knowledge Graph Embeddings into Topic Modeling , 2017, AAAI.

[28]  Dario Pompili,et al.  Collaborative Mobile Edge Computing in 5G Networks: New Paradigms, Scenarios, and Challenges , 2016, IEEE Communications Magazine.

[29]  Xuanjing Huang,et al.  Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter , 2016, EMNLP.

[30]  Christopher G. Brinton,et al.  The Power of Networks: Six Principles That Connect Our Lives , 2016 .

[31]  Xindong Wu,et al.  Topic Modeling over Short Texts by Incorporating Word Embeddings , 2016, PAKDD.

[32]  Sergey I. Nikolenko,et al.  Topic Quality Metrics Based on Distributed Word Representations , 2016, SIGIR.

[33]  Tao Zhang,et al.  Fog and IoT: An Overview of Research Opportunities , 2016, IEEE Internet of Things Journal.

[34]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[35]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[36]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[37]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[38]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[39]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[40]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[41]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[42]  Tai Qin,et al.  Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel , 2013, NIPS.

[43]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[44]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[45]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[46]  Joakim Nivre,et al.  A Dynamic Oracle for Arc-Eager Dependency Parsing , 2012, COLING.

[47]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[48]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[49]  Fan Chung Graham,et al.  Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model , 2012, COLT.

[50]  Bin Yu,et al.  Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim , 2012, 1204.2296.

[51]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[52]  Joemon M. Jose,et al.  Text segmentation: A topic modeling perspective , 2011, Inf. Process. Manag..

[53]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[54]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[55]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[56]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[57]  Andreas Stathopoulos,et al.  PRIMME: preconditioned iterative multimethod eigensolver—methods and software description , 2010, TOMS.

[58]  Juan-Zi Li,et al.  Knowledge discovery through directed probabilistic topic models: a survey , 2010, Frontiers of Computer Science in China.

[59]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[60]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[61]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[62]  T. Landauer,et al.  Latent semantic analysis , 2008, Scholarpedia.

[63]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[64]  Christopher D. Manning,et al.  Introduction to information retrieval , 2008 .

[65]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[66]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[67]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[68]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[69]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[70]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[71]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[72]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[73]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[74]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[75]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[76]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[77]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[78]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[79]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[80]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[81]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[82]  W. Walker Oral Cavity and Associated Structures , 1978 .

[83]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[84]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[85]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[86]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[87]  Malek Hajjem,et al.  Combining IR and LDA Topic Modeling for Filtering Microblogs , 2017, KES.

[88]  J. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[89]  Michael W. Berry,et al.  Text Mining Using Non-Negative Matrix Factorizations , 2004, SDM.

[90]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[91]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[92]  P. Resnik,et al.  A Class-Based Approach to Lexical Relationships , 1993 .

[93]  Hurst Jw,et al.  The Oral Cavity and Associated Structures -- Clinical Methods: The History, Physical, and Laboratory Examinations , 1990 .

[94]  S. Niwattanakul,et al.  Using of Jaccard Coefficient for Keywords Similarity , 2022 .