Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in NLP, pages 32--39,

In this paper we examine topic segmentation of narrative documents, which are characterized by long passages of text with few headings. We first present results suggesting that previous topic segmentation approaches are not appropriate for narrative text. We then present a feature-based method that combines features from diverse sources as well as learned features. Applied to narrative books and encyclopedia articles, our method shows results that are significantly better than previous segmentation approaches. An analysis of individual features is also provided and the benefit of generalization using outside resources is shown.

[1]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[2]  J. V. Rauff,et al.  Finite State Morphology , 2007 .

[3]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[4]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Manabu Okumura,et al.  Text Segmentation with Multiple Surface Linguistic Cues , 1999, COLING.

[7]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[8]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[10]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[12]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[13]  Alan F. Smeaton,et al.  Segmenting broadcast news streams using lexical chains , 2002 .

[14]  Hideki Kozima,et al.  Segmenting Narrative Text into Coherent Scenes , 1994 .

[15]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[16]  Hang Li,et al.  Topic Analysis Using a Finite Mixture Model , 2000, Inf. Process. Manag..

[17]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.