Knowledge models from PDF textbooks

ABSTRACT Textbooks are educational documents created, structured and formatted by domain experts with the primary purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extensible approach towards automated extraction of knowledge models from textbooks and enrichment of their content with additional links (both internal and external). The textbooks themselves essentially become hypertext documents where individual pages are annotated with important concepts in the domain. The evaluation experiments examine several aspects and stages of the approach, including the accuracy of model extraction, the pragmatic quality of extracted models using one of their possible applications— semantic linking of textbooks in the same domain, the accuracy of linking models to external knowledge sources and the effect of integration of multiple textbooks from the same domain. The results indicate high accuracy of model extraction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.

[1]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[2]  Thomas D. Sandry,et al.  Introductory Statistics With R , 2003, Technometrics.

[3]  P Ramnarayan,et al.  ISABEL: a web-based differential diagnostic aid for paediatrics: results from an initial performance evaluation , 2003, Archives of disease in childhood.

[4]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[5]  Xianpei Han,et al.  A Generative Entity-Mention Model for Linking Entities with Knowledge Base , 2011, ACL.

[6]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[7]  Rajan Chattamvelli,et al.  Statistics for Scientists and Engineers: Shanmugam/Statistics for Scientists and Engineers , 2015 .

[8]  Michael Evans,et al.  Measuring statistical evidence using relative belief , 2015, Computational and structural biotechnology journal.

[9]  Francesco M. Donini,et al.  A Logic-Based Approach to Named-Entity Disambiguation in the Web of Data , 2015, AI*IA.

[10]  Laura Hollink,et al.  Domain-Aware Ontology Matching , 2012, SEMWEB.

[11]  Key-Sun Choi,et al.  Named Entity Corpus Construction using Wikipedia and DBpedia Ontology , 2014, LREC.

[12]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[13]  Basil Ell,et al.  A Comparative Survey of DBpedia , Freebase , OpenCyc , Wikidata , and YAGO , 2015 .

[14]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[15]  Ana Arruarte Lasa,et al.  Automatic Generation of the Domain Module from Electronic Textbooks: Method and Validation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[16]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[17]  Valentin I. Spitkovsky,et al.  A comparison of Named-Entity Disambiguation and Word Sense Disambiguation , 2016, LREC.

[18]  J. Millis,et al.  THE UNIVERSITY OF , 2000 .

[19]  Catherine Faron-Zucker,et al.  Extraction of Relevant Resources and Questions from DBpedia to Automatically Generate Quizzes on Specific Domains , 2018, ITS.

[20]  Zhaohui Wu,et al.  Searching online book documents and analyzing book citations , 2013, ACM Symposium on Document Engineering.

[21]  Gabriella Kazai,et al.  ICDAR 2013 Competition on Book Structure Extraction , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[22]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Tommaso Di Noia,et al.  Semantic Wonder Cloud: Exploratory Search in DBpedia , 2010, ICWE Workshops.

[25]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[26]  Jean-Luc Meunier,et al.  On tables of contents and how to recognize them , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[27]  Zhi Tang,et al.  Analysis of Book Documents' Table of Content Based on Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[28]  Ruiheng Qiu,et al.  Comprehensive Global Typography Extraction System for Electronic Book Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[29]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[30]  Fabien L. Gandon,et al.  Discovery hub: on-the-fly linked data exploratory search , 2013, I-SEMANTICS '13.

[31]  Ian H. Witten,et al.  Topic indexing with Wikipedia , 2008 .

[32]  Ruiheng Qiu,et al.  A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures , 2011, 2011 International Conference on Document Analysis and Recognition.

[33]  Séamus Lawless,et al.  C-HTS: A Concept-based Hierarchical Text Segmentation approach , 2018, LREC.

[34]  Christian Bizer,et al.  Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections , 2009, ESWC.

[35]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[36]  Peter Brusilovsky,et al.  Adaptation "in the Wild": Ontology-Based Personalization of Open-Corpus Learning Material , 2012, EC-TEL.

[37]  Hans-Michael Kaltenbach,et al.  A Concise Guide to Statistics , 2011 .

[38]  Peter Brusilovsky,et al.  When One Textbook Is Not Enough: Linking Multiple Textbooks Using Probabilistic Topic Models , 2013, EC-TEL.

[39]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[40]  Ana Arruarte Lasa,et al.  Acquisition of the Domain Structure from Document Indexes Using Heuristic Reasoning , 2004, Intelligent Tutoring Systems.

[41]  Paolo Ferragina,et al.  Fast and Accurate Annotation of Short Texts with Wikipedia Pages , 2010, IEEE Software.

[42]  Gerhard Weikum,et al.  KORE: keyphrase overlap relatedness for entity disambiguation , 2012, CIKM.

[43]  Hannah Bast,et al.  A Benchmark and Evaluation for Text Extraction from PDF , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[44]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[45]  Sergey A. Sosnovsky,et al.  Transformation of PDF Textbooks into Intelligent Educational Resources , 2020, iTextbooks@AIED.

[46]  Gianluca Demartini,et al.  Large-scale linked data integration using probabilistic reasoning and crowdsourcing , 2013, The VLDB Journal.

[47]  Nancy A. Blumenstock The Chicago Manual of Style . By the University of Chicago Press. 13th ed. Chicago: University of Chicago Press, 1982. ix, 740 pp. Glossary of Technical Terms, Bibliography, Index. $25. , 1984, Journal of Asian Studies.

[48]  M. H. Faber,et al.  Statistics and Probability Theory: In Pursuit of Engineering Decision Support , 2012 .

[49]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[50]  Kurt Ament CHAPTER 1 – About indexing , 2001 .

[51]  Birger Stjernholm Madsen,et al.  Statistics for Non-Statisticians , 2011 .

[52]  Shubhashis Sengupta,et al.  Automatic extraction of glossary terms from natural language requirements , 2013, 2013 21st IEEE International Requirements Engineering Conference (RE).

[53]  R. H. Myers,et al.  STAT 319 : Probability & Statistics for Engineers & Scientists Term 152 ( 1 ) Final Exam Wednesday 11 / 05 / 2016 8 : 00 – 10 : 30 AM , 2016 .

[54]  Jay L. Devore,et al.  Modern Mathematical Statistics with Applications , 2021, Springer Texts in Statistics.

[55]  Baoding Liu,et al.  Uncertainty Theory - A Branch of Mathematics for Modeling Human Uncertainty , 2011, Studies in Computational Intelligence.

[56]  Sören Auer,et al.  AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data , 2014, International Semantic Web Conference.

[57]  Benjamin Bräutigam,et al.  Concept Hierarchy Extraction from Textbooks , 2015, DocEng.

[58]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[59]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .

[60]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[61]  Marilyn J. Chambliss 3. The characteristics of well-designed science textbooks , 2002 .

[62]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[63]  Ioana Hulpus,et al.  Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation , 2015, International Semantic Web Conference.

[64]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[65]  Michael Havbro Faber,et al.  Statistics and Probability Theory , 2012 .

[66]  Gabriella Kazai,et al.  Setting up a competition framework for the evaluation of structure extraction from OCR-ed books , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[67]  C. Lee Giles,et al.  A hybrid approach to discover semantic hierarchical sections in scholarly documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[68]  Michael O. Finkelstein Basic Concepts of Probability and Statistics in the Law , 2009 .

[69]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[70]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[71]  Dr. Marcus Hutter,et al.  Universal artificial intelligence , 2004 .

[72]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[73]  Christian Chiarcos,et al.  Using RDFa to Link Text and Dictionary Data for Medieval French , 2018 .

[74]  Roman Kern,et al.  Extraction of References Using Layout and Formatting Information from Scientific Articles , 2013, D Lib Mag..

[75]  Eduard H. Hovy,et al.  Layout-aware text extraction from full-text PDF of scientific articles , 2012, Source Code for Biology and Medicine.

[76]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[77]  Lena-Luise Stahn,et al.  Using TEI for textbook research , 2016, LT4DH@COLING.

[78]  I. E. Bradley,et al.  Introductory statistics for business and economics , 1981 .

[79]  Dan Tidhar,et al.  Retrieving Hierarchical Text Structure from Typeset Scientific Articles – a Prerequisite for E-Science Text Mining , 2005 .

[80]  Frederik Michel Dekking,et al.  A Modern Introduction to Probability and Statistics , 2005 .

[81]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[82]  Roman Kern,et al.  TeamBeam - Meta-Data Extraction from Scientific Literature , 2012, D Lib Mag..

[83]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[84]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[85]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[86]  E. H. Lloyd,et al.  Statistics for Scientists and Engineers. , 1966 .

[87]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[88]  Michel Dekking,et al.  A Modern Introduction to Probability and Statistics: Understanding Why and How , 2007 .

[89]  Brenda L. Sabey,et al.  Expository Text Comprehension: Helping Primary-Grade Teachers Use Expository Texts to Full Advantage , 2005 .

[90]  Chandrashekar Ramanathan,et al.  Challenges in generating bookmarks from TOC entries in e-books , 2012, DocEng '12.

[91]  Daniel Martins,et al.  Extracting compound terms from domain corpora , 2010, Journal of the Brazilian Computer Society.

[92]  Sergey A. Sosnovsky,et al.  Interlingua: Linking Textbooks Across Different Languages , 2019, iTextbooks@AIED.

[93]  Tommaso Di Noia,et al.  Ranking the Linked Data: The Case of DBpedia , 2010, ICWE.

[94]  Zizette Boufaïda,et al.  A Candidate Generation Algorithm for Named Entities Disambiguation Using DBpedia , 2018, WorldCIST.

[95]  Muhammad Imran,et al.  A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries , 2013, D Lib Mag..

[96]  Pablo Ruiz,et al.  The Diachronic Spanish Sonnet Corpus (DISCO): TEI and Linked Open Data Encoding, Data Distribution and Metrical Findings , 2018, DH.