SegGen: A Genetic Algorithm for Linear Text Segmentation

This paper describes SegGen, a new algorithm for linear text segmentation on general corpuses. It aims to segment texts into thematic homogeneous parts. Several existing methods have been used for this purpose, based on a sequential creation of boundaries. Here, we propose to consider boundaries simultaneously thanks to a genetic algorithm. SegGen uses two criteria: maximization of the internal cohesion of the formed segments and minimization of the similarity of the adjacent segments. First experimental results are promising and SegGen appears to be very competitive compared with existing methods.

[1]  Min-Yen Kan,et al.  Linear Segmentation and Segment Significance , 1998, VLC@COLING/ACL.

[2]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[3]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[4]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[5]  P. Bellot Méthodes de classification et de segmentation locales non supervisées pour la recherche documentaire , 2000 .

[6]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[7]  Vijay V. Raghavan,et al.  Optimal Determination of User-Oriented Clusters: An Application for the Reproductive Plan , 1987, ICGA.

[8]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[9]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[10]  Hsinchun Chen,et al.  Using sentence-selection heuristics to rank text segments in TXTRACTOR , 2002, JCDL '02.

[11]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[12]  Eckart Zitzler,et al.  Evolutionary algorithms for multiobjective optimization: methods and applications , 1999 .

[13]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[14]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[15]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[16]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[17]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.