Training and Domain Adaptation for Supervised Text Segmentation

Unlike traditional unsupervised text segmentation methods, recent supervised segmentation models rely on Wikipedia as the source of large-scale segmentation supervision. These models have, however, predominantly been evaluated on the in-domain (Wikipedia-based) test sets, preventing conclusions about their general segmentation efficacy. In this work, we focus on the domain transfer performance of supervised neural text segmentation in the educational domain. To this end, we first introduce K12Seg, a new dataset for evaluation of supervised segmentation, created from educational reading material for grade-1 to college-level students. We then benchmark a hierarchical text segmentation model (HITS), based on RoBERTa, in both in-domain and domain-transfer segmentation experiments. While HITS produces state-of-the-art in-domain performance (on three Wikipedia-based test sets), we show that, subject to the standard full-blown fine-tuning, it is susceptible to domain overfitting. We identify adapter-based fine-tuning as a remedy that substantially improves transfer performance.

[1]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[2]  Yi Wang,et al.  Sentiment text classification of customers reviews on the Web based on SVM , 2010, 2010 Sixth International Conference on Natural Computation.

[3]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[4]  Yang Liu,et al.  Extractive summarization of multi-party meetings through discourse segmentation , 2015, Natural Language Engineering.

[5]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Séamus Lawless,et al.  C-HTS: A Concept-based Hierarchical Text Segmentation approach , 2018, LREC.

[8]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[9]  Stephen E. Robertson,et al.  Applying Machine Learning to Text Segmentation for Information Retrieval , 2004, Information Retrieval.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[12]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[13]  Nikita Nikitinsky,et al.  Exploring Influence of Topic Segmentation on Information Retrieval Quality , 2018, INSCI.

[14]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[15]  Ling Shao,et al.  Neural Text Segmentation and Its Application to Sentiment Analysis , 2020 .

[16]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[17]  Joemon M. Jose,et al.  Text segmentation via topic modeling: an analytical study , 2009, CIKM.

[18]  Athanasios Kehagias,et al.  A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[19]  Iryna Gurevych,et al.  MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale , 2020, EMNLP.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[22]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[23]  Swapna Somasundaran,et al.  Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation , 2020, AAAI.

[24]  Goran Glavas,et al.  Unsupervised Text Segmentation Using Semantic Relatedness Graphs , 2016, *SEMEVAL.

[25]  Marie-Francine Moens,et al.  The use of topic segmentation for automatic summarization , 2002, ACL 2002.

[26]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[27]  David R. Karger,et al.  Global Models of Document Structure using Latent Permutations , 2009, NAACL.

[28]  Violaine Prince,et al.  Text Segmentation Based on Document Understanding for Information Retrieval , 2007, NLDB.

[29]  Jacob Eisenstein,et al.  Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion , 2009, NAACL.