A Joint Model for Document Segmentation and Segment Labeling

Text segmentation aims to uncover latent structure by dividing text from a document into coherent sections. Where previous work on text segmentation considers the tasks of document segmentation and segment labeling separately, we show that the tasks contain complementary information and are best addressed jointly. We introduce Segment Pooling LSTM (S-LSTM), which is capable of jointly segmenting a document and labeling segments. In support of joint training, we develop a method for teaching the model to recover from errors by aligning the predicted and ground truth segments. We show that S-LSTM reduces segmentation error by 30% on average, while also improving segment labeling.

[1]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[2]  Goran Glavas,et al.  Unsupervised Text Segmentation Using Semantic Relatedness Graphs , 2016, *SEMEVAL.

[3]  Dina Demner-Fushman,et al.  Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval , 2017, AMIA.

[4]  Lucy Vanderwende,et al.  Statistical Section Segmentation in Free-Text Clinical Records , 2012, LREC.

[5]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[6]  Janet K. Swaffar,et al.  Reading For Meaning: An Integrated Approach to Language Learning , 1990 .

[7]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[8]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[9]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[10]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[11]  Xueqi Cheng,et al.  Outline Generation: Understanding the Inherent Content Structure of Documents , 2019, SIGIR.

[12]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[13]  Chin-Yew Lin,et al.  Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough? , 2004, NTCIR.

[14]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[15]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[16]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[17]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[18]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[19]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[20]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[21]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[25]  Kavita Ganesan,et al.  A general supervised approach to segmentation of clinical texts , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[26]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[27]  David Suendermann-Oeft,et al.  Detecting Section Boundaries in Medical Dictations: Toward Real-Time Conversion of Medical Dictations to Clinical Reports , 2018, SPECOM.

[28]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[29]  David R. Karger,et al.  Content Modeling Using Latent Permutations , 2009, J. Artif. Intell. Res..

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Noah A. Smith,et al.  Training with Exploration Improves a Greedy Stack LSTM Parser , 2016, EMNLP.

[32]  Markus Kreuzthaler,et al.  Current approaches to identify sections within clinical narratives from electronic health records: a systematic review , 2019, BMC Medical Research Methodology.

[33]  Alexander Löser,et al.  SECTOR: A Neural Model for Coherent Topic Segmentation and Classification , 2019, TACL.

[34]  Parviz Ajideh,et al.  Schema Theory-Based Pre-Reading Tasks: A Neglected Essential in the ESL Reading Class. , 2003 .

[35]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[36]  David Yarowsky,et al.  Techniques in Speech Acoustics , 1999, Computational Linguistics.