litewi: A combined term extraction and entity linking method for eliciting educational ontologies from textbooks

Major efforts have been conducted on ontology learning, that is, semiautomatic processes for the construction of domain ontologies from diverse sources of information. In the past few years, a research trend has focused on the construction of educational ontologies, that is, ontologies to be used for educational purposes. The identification of the terminology is crucial to build ontologies. Term extraction techniques allow the identification of the domain‐related terms from electronic resources. This paper presents LiTeWi, a novel method that combines current unsupervised term extraction approaches for creating educational ontologies for technology supported learning systems from electronic textbooks. LiTeWi uses Wikipedia as an additional information source. Wikipedia contains more than 30 million articles covering the terminology of nearly every domain in 288 languages, which makes it an appropriate generic corpus for term extraction. Furthermore, given that its content is available in several languages, it promotes both domain and language independence. LiTeWi is aimed at being used by teachers, who usually develop their didactic material from textbooks. To evaluate its performance, LiTeWi was tuned up using a textbook on object oriented programming and then tested with two textbooks of different domains—astronomy and molecular biology.

[1]  John R. Josephson,et al.  What Are They? Why Do We Need Them? , 1999 .

[2]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[5]  L. Darrell Whitley,et al.  The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best , 1989, ICGA.

[6]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[7]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  Horace H. S. Ip,et al.  Educational Ontologies Construction for Personalized Learning on the Web , 2007 .

[10]  Jacqueline Bourdeau,et al.  Using Ontological Engineering to Overcome Common AI-ED Problems , 2000 .

[11]  Hideki Mima,et al.  An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese , 2000 .

[12]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[13]  W. N. Borst,et al.  Construction of Engineering Ontologies for Knowledge Sharing and Reuse , 1997 .

[14]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[15]  Yaakov HaCohen-Kerner,et al.  Automatic Extraction and Learning of Keyphrases from Scientific Articles , 2005, CICLing.

[16]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Balakrishnan Chandrasekaran,et al.  What are ontologies, and why do we need them? , 1999, IEEE Intell. Syst..

[19]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[20]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[21]  É. Benveniste Problèmes de linguistique générale , 1968 .

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[23]  Chantal Enguehard,et al.  Automatic Natural Acquisition of a Terminology , 1995, J. Quant. Linguistics.

[24]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[25]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[26]  Peter Brusilovsky,et al.  One Practical Algorithm of Creating Teaching Ontologies , 2005 .

[27]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[28]  Paul Buitelaar,et al.  Ontology Learning from Text: An Overview , 2005 .

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  Roberto Basili,et al.  Identification of Relevant Terms to Support the Construction of Domain Ontologies , 2001, HTLKM@ACL.

[31]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[32]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[33]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[34]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[35]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[36]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[37]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[38]  Fabio Massimo Zanzotto,et al.  Terminology Extraction: An Analysis of Linguistic and Statistical Approaches , 2005 .

[39]  Ian Morison,et al.  Introduction to Astronomy and Cosmology , 2008 .

[40]  Lois L. Earl Experiments in automatic extracting and indexing . Information Storage and Retrieval , 2018 .

[41]  Timothy W. Finin,et al.  Enabling Technology for Knowledge Sharing , 1991, AI Mag..

[42]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[43]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[44]  J. Silva,et al.  A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[45]  Ana Arruarte Lasa,et al.  Automating the authoring of learning material in Computer Engineering education , 2012, 2012 Frontiers in Education Conference Proceedings.

[46]  Lois L. Earl,et al.  Experiments in automatic extracting and indexing , 1970, Inf. Storage Retr..

[47]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[48]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[49]  R. Plackett,et al.  Karl Pearson and the Chi-squared Test , 1983 .

[50]  Sophia Ananiadou,et al.  Identifying contextual information for multi-word term extraction , 1999 .

[51]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[52]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[53]  Dieter Fensel,et al.  Knowledge Engineering: Principles and Methods , 1998, Data Knowl. Eng..

[54]  Mark Fischetti,et al.  Weaving the Web : the past and present and future of the World Wide Web by its inventor , 2000 .

[55]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[56]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[57]  Jacqueline Bourdeau,et al.  Using Ontological Engineering to Overcome AI-ED Problems: Contribution, Impact and Perspectives , 2015, International Journal of Artificial Intelligence in Education.

[58]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[59]  Thomas R. Gruber,et al.  The Role of Common Ontology in Achieving Sharable, Reusable Knowledge Bases , 1991, KR.

[60]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[61]  Ana Arruarte Lasa,et al.  Evaluating the Automatic Extraction of Learning Objects from Electronic Textbooks Using ErauzOnt , 2012, ITS.

[62]  Iryna Gurevych,et al.  Mining Multiword Terms from Wikipedia , 2012 .

[63]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[64]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[65]  Tatiana Gavrilova,et al.  DEVELOPMENT OF EDUCATIONAL ONTOLOGY FOR C-PROGRAMMING , 2006 .

[66]  Christian Jacquemin,et al.  EMPIRICAL OBSERVATION OF TERM VARIATIONS AND PRINCIPLES FOR THEIR DESCRIPTION , 1996 .

[67]  Timothy Baldwin,et al.  Bayesian Text Segmentation for Index Term Identification and Keyphrase Extraction , 2012, COLING.

[68]  Maria Teresa Pazienza,et al.  Semi-Automatic Ontology Development: Processes and Resources , 2012 .

[69]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[70]  Ravi Lourdusamy,et al.  Towards Ontology Development for Teaching Programming Language , 2011 .

[71]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[72]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[73]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[74]  Steffen Staab,et al.  Mining Ontologies from Text , 2000, EKAW.