Topic Segmentation with Hybrid Document Indexing

We present a domain-independent unsupervised topic segmentation approach based on hybrid document indexing. Lexical chains have been successfully employed to evaluate lexical cohesion of text segments and to predict topic boundaries. Our approach is based in the notion of semantic cohesion. It uses spectral embedding to estimate semantic association between content nouns over a span of multiple text segments. Our method significantly outperforms the baseline on the topic segmentation task and achieves performance comparable to state-of-the-art methods that incorporate domain specific information.

[1]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[4]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[5]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[6]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[7]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[10]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[11]  Gina-Anne Levow,et al.  Term representation with Generalized Latent Semantic Analysis , 2007 .

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[14]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[15]  Jonathan G. Fiscus,et al.  NIST's 1998 topic detection and tracking evaluation (TDT2) , 1999, EUROSPEECH.

[16]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[17]  Gina-Anne Levow,et al.  Graph-based Generalized Latent Semantic Analysis for Document Representation , 2006 .

[18]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[19]  Robert A. van de Geijn,et al.  A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations , 2005, SIAM J. Sci. Comput..

[20]  Johanna D. Moore,et al.  Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.