Exploring Influence of Topic Segmentation on Information Retrieval Quality

In the present paper we address the issue of how an information retrieval system might be improved via text segmentation and to what extent. We assume that topic text segmentation allows one to better model text structure and therefore language itself, which influences the quality of text representation. We propose a search pipeline based on text segmentation by means of BigARTM tool and TopicTiling algorithm. We test the initial hypothesis by conducting experiments with several baseline models on two textual collections. The results are rather contradictory: while one collection showed that segmentation does improve the quality of retrieval, the other one demonstrated that segmentation does not influence the quality significantly.

[1]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[4]  Konstantin Vorontsov,et al.  Additive regularization of topic models , 2015, Machine Learning.

[5]  Oleksandr Frei,et al.  BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections , 2015, AIST.

[6]  Violaine Prince,et al.  Text Segmentation Based on Document Understanding for Information Retrieval , 2007, NLDB.

[7]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[8]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[9]  Lei Xie,et al.  Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation , 2007, INTERSPEECH.

[10]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[11]  Chris Biemann,et al.  Text Segmentation with Topic Models , 2012, Journal for Language Technology and Computational Linguistics.

[12]  P. Galu Application of Topic Segmentation in Audiovisual Information Retrieval , 2012 .

[13]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[16]  Gareth J. F. Jones,et al.  Utilizing sub-topical structure of documents for information retrieval , 2011, PIKM '11.