BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

We propose BeamSeg, a joint model for segmentation and topic identification of documents from the same domain. The model assumes that lexical cohesion can be observed across documents, meaning that segments describing the same topic use a similar lexical distribution over the vocabulary. The model implements lexical cohesion in an unsupervised Bayesian setting by drawing from the same language model segments with the same topic. Contrary to previous approaches, we assume that language models are not independent, since the vocabulary changes in consecutive segments are expected to be smooth and not abrupt. We achieve this by using a dynamic Dirichlet prior that takes into account data contributions from other topics. BeamSeg also models segment length properties of documents based on modality (textbooks, slides, etc.). The evaluation is carried out in three datasets. In two of them, improvements of up to 4.8% and 7.3% are obtained in the segmentation and topic identifications tasks, indicating that both tasks should be jointly modeled.

[1]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[2]  James R. Foulds,et al.  Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA , 2015, J. Mach. Learn. Res..

[3]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[4]  Chris Fournier,et al.  Evaluating Text Segmentation using Boundary Edit Distance , 2013, ACL.

[5]  Yasuo Ariki,et al.  Topic tracking language model for speech recognition , 2011, Comput. Speech Lang..

[6]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Patrick Jähnichen,et al.  Scalable Generalized Dynamic Topic Models , 2018, AISTATS.

[8]  Ivan Titov,et al.  Multi-document topic segmentation , 2010, CIKM.

[9]  Yuji Matsumoto,et al.  Annotating Semantic Relations Combining Facts and Opinions , 2009, Linguistic Annotation Workshop.

[10]  Michael Halliday,et al.  Cohesion in English , 1976 .

[11]  Alexander A. Alemi,et al.  Text Segmentation based on Semantic Word Embeddings , 2015, ArXiv.

[12]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[13]  Jacob Eisenstein,et al.  Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion , 2009, NAACL.

[14]  T. Minka Estimating a Dirichlet distribution , 2012 .

[15]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[16]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[17]  David G. Novick,et al.  The Similar Segments in Social Speech Task , 2013, MediaEval.

[18]  Johanna D. Moore,et al.  Automatic Segmentation of Multiparty Dialogue , 2006, EACL.

[19]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[20]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[21]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[22]  Maxine Eskénazi,et al.  Multi-document Topic Segmentation Using Bayesian Estimation , 2016, 2016 IEEE Tenth International Conference on Semantic Computing (ICSC).

[23]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[24]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[25]  Djemel Ziou,et al.  Edge Detection Techniques-An Overview , 1998 .

[26]  Yi Yu,et al.  TRACE: Linguistic-Based Approach for Automatic Lecture Video Segmentation Leveraging Wikipedia Texts , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[27]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[28]  Dafna Shahaf,et al.  Trains of thought: generating information maps , 2012, WWW.

[29]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[30]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[31]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  Shafiq R. Joty,et al.  Topic Segmentation and Labeling in Asynchronous Conversations , 2013, J. Artif. Intell. Res..

[33]  Xiaolin Li,et al.  GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model , 2018, EMNLP.

[34]  John Yen,et al.  Topic segmentation with shared topic detection and alignment of multiple documents , 2007, SIGIR.

[35]  Jure Leskovec,et al.  Overlapping community detection at scale: a nonnegative matrix factorization approach , 2013, WSDM.

[36]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Anna Kazantseva,et al.  Linear Text Segmentation Using Affinity Propagation , 2011, EMNLP.

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[39]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[40]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[41]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[42]  Jure Leskovec,et al.  Community Detection in Networks with Node Attributes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[44]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[45]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[46]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[47]  Maxine Eskénazi,et al.  MUSED: A multimedia multi-document dataset for topic segmentation , 2018, Natural Language Engineering.

[48]  Weijing Huang PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields , 2018, ACL.

[49]  Daniel Simpson,et al.  Asynchronous Gibbs Sampling , 2015, AISTATS.

[50]  Ying Huang,et al.  Efficient Correlated Topic Modeling with Topic Embedding , 2017, KDD.

[51]  Wai Lam,et al.  An unsupervised topic segmentation model incorporating word order , 2013, SIGIR.

[52]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[53]  Georgios Balikas,et al.  Topical Coherence in LDA-based Models through Induced Segmentation , 2017, ACL.

[54]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[55]  Rebecca J. Passonneau,et al.  Discourse Segmentation by Human and Automated Means , 1997, CL.

[56]  Lan Du,et al.  Unsupervised Text Segmentation Based on Native Language Characteristics , 2017, ACL.

[57]  Sungjin Lee,et al.  Script-description Pair Extraction from Text Documents of English as Second Language Podcast , 2010, CSEDU.

[58]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.

[59]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[60]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[61]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[62]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[63]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[64]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[65]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[66]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[67]  Alexander Löser,et al.  SECTOR: A Neural Model for Coherent Topic Segmentation and Classification , 2019, TACL.

[68]  Ioannis Mitliagkas,et al.  Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much , 2016, NIPS.

[69]  David Draper,et al.  GPU-accelerated Gibbs sampling: a case study of the Horseshoe Probit model , 2016, Stat. Comput..

[70]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[71]  Houfeng Wang,et al.  Learning to Rank Semantic Coherence for Topic Segmentation , 2017, EMNLP.

[72]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[73]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[74]  Maxine Eskénazi,et al.  Efficient Navigation in Learning Materials: An Empirical Study on the Linking Process , 2018, AIED.

[75]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[76]  A. Damodar,et al.  Automatic keyphrase extraction and segmentation of video lectures , 2012, 2012 IEEE International Conference on Technology Enhanced Education (ICTEE).

[77]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[78]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[80]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[81]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[82]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .