A Neural CRF-based Hierarchical Approach for Linear Text Segmentation

We consider the problem of segmenting unformatted text and transcripts linearly based on their topical structure. While prior approaches explicitly train to predict segment boundaries, our proposed approach solves this task by inferring the hierarchical segmentation structure associated with the input text fragment. Given the lack of a large annotated dataset for this task, we propose a data curation strategy and create a corpus of over 700K Wikipedia articles with their hierarchical structures. We then propose the first supervised approach to generating hierarchical segmentation structures based on these annotations. Our method, in particular, is based on a neural conditional random field (CRF), which explicitly models the statistical dependency between a node and its constituent child nodes. We introduce a new data augmentation scheme as part of our model training strategy, which involves sampling a variety of node aggregations, permutations, and removals, all of which help capture fine-grained and coarse topical shifts in the data and improve model performance. Extensive experiments show that our model outperforms or achieves competitive performance when compared to previous state-of-the-art algorithms in the following settings: rich-resource, cross-domain transferability, few-shot supervision, and segmentation when topic label annotations are provided.

[1]  Wray L. Buntine,et al.  Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence , 2021, EMNLP.

[2]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[3]  Yu Zhang,et al.  Fast and Accurate Neural CRF Constituency Parsing , 2020, IJCAI.

[4]  Douglas W. Oard,et al.  A Joint Model for Document Segmentation and Segment Labeling , 2020, ACL.

[5]  Boris Dadachev,et al.  Text Segmentation by Cross Segment Attention , 2020, EMNLP.

[6]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Alexander Löser,et al.  SECTOR: A Neural Model for Coherent Topic Segmentation and Classification , 2019, TACL.

[9]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[10]  Vasudeva Varma,et al.  Attention-Based Neural Text Segmentation , 2018, ECIR.

[11]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[12]  Oriol Nieto,et al.  Evaluating Hierarchical Structure in Music Annotations , 2017, Front. Psychol..

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Dan Klein,et al.  A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[15]  Goran Glavas,et al.  Unsupervised Text Segmentation Using Semantic Relatedness Graphs , 2016, *SEMEVAL.

[16]  Pascale Sébillot,et al.  Hierarchical Topic Structuring: From Dense Segmentation to Topically Focused Fragments via Burst Analysis , 2015, RANLP.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[20]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[21]  Anna Kazantseva,et al.  Linear Text Segmentation Using Affinity Propagation , 2011, EMNLP.

[22]  Jacob Eisenstein,et al.  Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion , 2009, NAACL.

[23]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[24]  Sung-Hyon Myaeng,et al.  Semantic passage segmentation based on sentence topics for question answering , 2007, Inf. Sci..

[25]  James R. Glass,et al.  Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input , 2007, ACL.

[26]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[27]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[28]  Athanasios Kehagias,et al.  A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[29]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[30]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[31]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[32]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[33]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[34]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[35]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[36]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[37]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[38]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Hans Leiß,et al.  To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm , 2009, Informatica Didact..

[41]  T. V. Dijk,et al.  EPISODES AS UNITS OF DISCOURSE ANALYSIS , 2006 .

[42]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[43]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[44]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[45]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[46]  Itiroo Sakai Syntax in universal translation , 1961, EARLYMT.