Applying Topic Segmentation to Document-Level Information Retrieval

In the present paper we discuss how text segmentation could be applied in the information retrieval domain. We assume that topic text segmentation allows one to better model text structure and therefore language itself, which influences the quality of text representation. We test the initial hypothesis by conducting experiments with several baseline models on the arXiv dataset comparing their quality on whole texts and on segmented texts. The experiments demonstrated that, indeed, the quality of retrieval is generally slightly improved.

[1]  Chris Biemann,et al.  Text Segmentation with Topic Models , 2012, Journal for Language Technology and Computational Linguistics.

[2]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[3]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[4]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[5]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[10]  Konstantin Vorontsov,et al.  Additive regularization of topic models , 2015, Machine Learning.

[11]  Oleksandr Frei,et al.  BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections , 2015, AIST.

[12]  Violaine Prince,et al.  Text Segmentation Based on Document Understanding for Information Retrieval , 2007, NLDB.

[13]  Lei Xie,et al.  Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation , 2007, INTERSPEECH.

[14]  Gareth J. F. Jones,et al.  Utilizing sub-topical structure of documents for information retrieval , 2011, PIKM '11.

[15]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Xueqi Cheng,et al.  A Study of MatchPyramid Models on Ad-hoc Retrieval , 2016, ArXiv.

[20]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.