Topic segmentation with shared topic detection and alignment of multiple documents

Topic detection and tracking and topic segmentation play an important role in capturing the local and sequential information of documents. Previous work in this area usually focuses on single documents, although similar multiple documents are available in many domains. In this paper, we introduce a novel unsupervised method for shared topic detection and topic segmentation of multiple similar documents based on mutual information (MI) and weighted mutual information (WMI) that is a combination of MI and term weights. The basic idea is that the optimal segmentation maximizes MI (or WMI). Our approach can detect shared topics among documents. It can find the optimal boundaries in a document, and align segments among documents at the same time. It also can handle single-document segmentation as a special case of the multi-document segmentation and alignment. Our methods can identify and strengthen cue terms that can be used for segmentation and partially remove stop words by using term weights based on entropy learned from multiple documents. Our experimental results show that our algorithm works well for the tasks of single-document segmentation, shared topic detection, and multi-document segmentation. Utilizing information from multiple documents can tremendously improve the performance of topic segmentation, and using WMI is even better than using MI for the multi-document segmentation.

[1]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[2]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[3]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[6]  Xiang Ji,et al.  Correlating multilingual documents via bipartite graph modeling , 2002, SIGIR '02.

[7]  Larry Gillick,et al.  A hidden Markov model approach to text segmentation and event tracking , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[9]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Xiang Ji,et al.  Domain-independent text segmentation using anisotropic diffusion and dynamic programming , 2003, SIGIR.

[12]  Heidi Christensen,et al.  Maximum entropy segmentation of broadcast news , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Xiang Ji,et al.  Correlating summarization of a pair of multilingual documents , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[14]  Manabu Okumura,et al.  Text Segmentation with Multiple Surface Linguistic Cues , 1999, COLING.

[15]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[16]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[17]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[22]  Manabu Okumura,et al.  Text Segmentation with Multiple Surface Linguistic Cues , 1998, COLING-ACL.

[23]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[24]  Xiang Ji,et al.  Extracting Shared Topics of Multiple Documents , 2003, PAKDD.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  John Yen,et al.  Multi-task text segmentation and alignment based on weighted mutual information , 2006, CIKM '06.

[27]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[28]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[29]  Tin Kam Ho Stop word location and identification for adaptive text recognition , 2000, International Journal on Document Analysis and Recognition.