MOOCon: A Framework for Semi-supervised Concept Extraction from MOOC Content

Recent years have witnessed the rapid development of Massive Open Online Courses (MOOCs). MOOC platforms not only offer a one-stop learning setting, but also aggregate a large number of courses with various kinds of textual content, e.g. video subtitles, quizzes and forum content. MOOCs are also regarded as a large-scale ‘knowledge base’ which covers various domains. However, all the contents generated by instructors and learners are unstructured. In order to process the data to be structured for further knowledge management and mining, the first step could be concept extraction. In this paper, we expect to utilize human knowledge through labeling data, and propose a framework for concept extraction based on machine learning methods. The framework is flexible to support semi-supervised learning, in order to alleviate human effort of labeling training data. Also course-agnostic features are designed for modeling cross-domain data. Experimental results demonstrate that only 10% labeled data can lead to acceptable performance, and the semi-supervised learning method is comparable to the supervised version under the consistent framework. We find the textual contents of various forms, i.e. subtitles, PPTs and questions, should be separately processed due to their formal difference. At last we evaluate a new task: identifying needs of concept comprehension. Our framework can work well in doing identification on forum content while learning a model from subtitles.

[1]  Goo Jun,et al.  A Self-training Approach to Cost Sensitive Uncertainty Sampling , 2009, ECML/PKDD.

[2]  Tiejun Zhao,et al.  Chinese Terminology Extraction Using EM-Based Transfer Learning Method , 2013, CICLing.

[3]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[4]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[5]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[6]  Jure Leskovec,et al.  Engaging with massive online courses , 2014, WWW.

[7]  Anirban Dasgupta,et al.  Superposter behavior in MOOC forums , 2014, L@S.

[8]  Christopher D. Manning,et al.  Software Document Terminology Recognition , 2015, AAAI Spring Symposia.

[9]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[10]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[11]  Nigel Collier,et al.  Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain , 2001 .

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[14]  Yan Zhang,et al.  Influence Analysis by Heterogeneous Network in MOOC Forums: What can We Discover? , 2015, EDM.

[15]  Yu Bin,et al.  Term Extraction Method Based on Mutual Information with Threshold Interval , 2011 .

[16]  Carolyn Penstein Rosé,et al.  Investigating How Student's Cognitive Behavior in MOOC Discussion Forum Affect Learning Gains , 2015, EDM.

[17]  Carolyn Penstein Rosé,et al.  Sentiment Analysis in MOOC Discussion Forums: What does it tell us? , 2014, EDM.

[18]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[19]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[22]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.