Topic segmentation using word-level semantic relatedness functions

Semantic relatedness deals with the problem of measuring how much two words are related to each other. While there is a large body of research for developing new measures, the use of semantic relatedness (SR) measures in topic segmentation has not been explored. In this research the performance of different SR measures is evaluated in the topic segmentation problem. To this end, two topic segmentation algorithms that use the difference in SR of words are introduced. Our results indicate that using an SR measure trained with a general domain corpora achieves better results than topic segmentation algorithms using Wordnet or simple word repetition. Furthermore, when compared with computationally more complex algorithms performing global analysis, our local analysis, enhanced with general domain lexical semantic information, achieves comparable results.

[1]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[2]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[3]  Joemon M. Jose,et al.  Text segmentation: A topic modeling perspective , 2011, Inf. Process. Manag..

[4]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[5]  Michael Halliday,et al.  Cohesion in English , 1976 .

[6]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[9]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[10]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[11]  Yves Bestgen,et al.  Squibs and Discussions: Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001) , 2006, CL.

[12]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[13]  G. Youmans A New Tool for Discourse Analysis: The Vocabulary-Management Profile. , 1991 .

[14]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[15]  Mehryar Mohri,et al.  Discriminative Topic Segmentation of Text and Speech , 2010, AISTATS.

[16]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[19]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[20]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[21]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[22]  Alan F. Smeaton,et al.  SeLeCT: a lexical cohesion based news story segmentation system , 2004, AI Commun..

[23]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[24]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[25]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[26]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[27]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[28]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[29]  Roberto Navigli,et al.  Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity , 2013, ACL.

[30]  Reinhard Rapp Discovering the Senses of an Ambiguous Word by Clustering its Local Contexts , 2004, GfKl.