A Domain-Independent Text Segmentation Method for Educational Course Content

In this study, we have proposed a domain-independent text segmentation algorithm which is particularly useful in online educational courses. Text segmentation is proven to be helpful in improving the readability of large corpora of documents, which is essential in education scenarios. While existing domain-dependent text segmentation methods have much better performance than domain-independent methods in most cases, only domain-independent methods are applicable to sparse training content in education scenarios. Our method, unlike other domain-dependent text segmentation methods, doesn't require heavy training on prior documents, but only need to train on the current corpus of documents with topic distributions and word vector representations. Our proposed method develops text boundaries between small text units in three steps. We first calculate input text features via topical distributions (latent Dirichlet allocation) and word embeddings (GloVe). We then calculate similarity values between such textual features and detect distribution changes between the similarities. We finally perform clustering on the similarities and detect sub-topic boundaries via cluster differences. We test our method on two datasets, one from an online education course and one from a popular public dataset - Choi Dataset. The results demonstrate that our method outperforms other state-of-the-art domain-independent text segmentation approaches while achieving performance comparable to a few domain-dependent algorithms.

[1]  Xihong Wu,et al.  Text Segmentation with LDA-Based Fisher Kernel , 2008, ACL.

[2]  Joemon M. Jose,et al.  Text segmentation via topic modeling: an analytical study , 2009, CIKM.

[3]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[4]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[5]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[6]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[7]  Mung Chiang,et al.  Behavior in social learning networks: Early detection for online short-courses , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[8]  Richard G. Baraniuk,et al.  Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics , 2014, EDM.

[9]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[12]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[13]  Xiang Ji,et al.  Domain-independent text segmentation using anisotropic diffusion and dynamic programming , 2003, SIGIR.

[14]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[15]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[16]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[17]  Mung Chiang,et al.  Behavioral Analysis at Scale: Learning Course Prerequisite Structures from Learner Clickstreams , 2018, EDM.

[18]  Sagnik Banerjee,et al.  A study of N-gram and Embedding Representations for Native Language Identification , 2017, BEA@EMNLP.

[19]  Katsumi Nitta,et al.  Domain-Independent Unsupervised Text Segmentation for Data Management , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[20]  Athanasios Kehagias,et al.  A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[21]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[22]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[23]  Oskari Heinonen,et al.  Optimal Multi-Paragraph Text Segmentation by Dynamic Programming , 1998, ACL.

[24]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[25]  Richard G. Baraniuk,et al.  Contextual multi-armed bandit algorithms for personalized learning action selection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[27]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.