Ontological Document Reading: An Experience Report

Ontological document reading is defined as automatically and appropriately populating a conceptual model representing an ontological conceptualization of some fragment of the real world. Appropriately populating the conceptualization involves not only extracting the information with respect to the declared object and relationship sets of the conceptual model but also involves checking the extracted information for real-world constraint violations, standardizing the data, and inferring the unwritten information that a document author intended to convey. Appropriately populating an ontology may, in addition, require adjustments to the ontology itself. This approach to document reading is presented in terms of an effort to build a system to extract the genealogical information in family history books. The status of the reading system is reported. Also explained is how the generated results can be imported into and thus contribute to the construction of a large repository of world-wide family interrelationships. The reading system’s potential use for constructing similar knowledge repositories in other domains is foreshadowed.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  David W. Embley,et al.  Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration , 2001, Workshop on Information Integration on the Web.

[3]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[4]  David W. Embley,et al.  Pragmatic Quality Assessment for Automatically Extracted Data , 2016, ER.

[5]  John E. Laird,et al.  Cognitive Modeling Approaches to Language Comprehension Using Construction Grammar , 2017, AAAI Spring Symposia.

[6]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[7]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[8]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[9]  John E. Laird,et al.  The Soar Cognitive Architecture , 2012 .

[10]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[11]  Mohammed Bennamoun,et al.  Ontology learning from text: A look back and into the future , 2012, CSUR.

[12]  Heinrich C. Mayr,et al.  Deriving static and dynamic concepts from software requirements using sophisticated tagging , 2007, Data Knowl. Eng..

[13]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[14]  Peter Lindes,et al.  OntoSoar: Using Language to Find Genealogy Facts , 2014 .

[15]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[16]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[17]  Ramez Elmasri,et al.  A graphical data manipulation language for an extended entity-relationship model , 1990, Computer.

[18]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[19]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[20]  Paul Buitelaar,et al.  Towards Linguistically Grounded Ontologies , 2009, ESWC.

[21]  Heinrich C. Mayr,et al.  From Scenarios to KCPM Dynamic Schemas: Aspects of Automatic Mapping , 2003, NLDB.

[22]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[23]  Scot. Kilbarchan,et al.  Index to the Register of Marriages and Baptisms in the Parish of Kilbarchan, 1649-1772 , 2018 .

[24]  Christopher D. Manning,et al.  Improving Coreference Resolution by Learning Entity-Level Distributed Representations , 2016, ACL.

[25]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.

[26]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[27]  Claire Cardie,et al.  Evaluating an Information Extraction System , 1994 .

[28]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[29]  Tae Woo Kim A Green Form-Based Information Extraction System for Historical Documents , 2017 .

[30]  Ben Shneiderman,et al.  Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation , 2008, IEEE Transactions on Visualization and Computer Graphics.

[31]  Boris Wyssusek,et al.  On Ontological Foundations of Conceptual Modelling , 2006, Scand. J. Inf. Syst..

[32]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[33]  David W. Embley,et al.  HyKSS: Hybrid Keyword and Semantic Search , 2015, Journal on Data Semantics.

[34]  George Nagy Estimation, Learning, and Adaptation: Systems That Improve with Use , 2012, SSPR/SPR.

[35]  David W. Embley,et al.  Multilingual Extraction Ontologies , 2014, Towards the Multilingual Semantic Web.

[36]  David W. Embley,et al.  Source discovery and schema mapping for data integration , 2003 .

[37]  Eduard H. Hovy,et al.  A Deeper Look into Features for Coreference Resolution , 2009, DAARC.

[38]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[39]  Cui Tao,et al.  FOCIH: Form-Based Ontology Creation and Information Harvesting , 2009, ER.

[40]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[41]  Avigdor Gal,et al.  Managing Uncertainty in Schema Matcher Ensembles , 2007, SUM.

[42]  Ralph Grishman,et al.  Information Extraction , 2015, IEEE Intelligent Systems.

[43]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[44]  David W. Embley Programming with data frames for everyday data items , 1980, AFIPS '80.

[45]  David W. Embley,et al.  Conceptual Modeling in Accelerating Information Ingest into Family Tree , 2017, Conceptual Modeling Perspectives.

[46]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[47]  Heinrich C. Mayr,et al.  Conceptual predesign bridging the gap between requirements and conceptual design , 1998, Proceedings of IEEE International Symposium on Requirements Engineering: RE '98.

[48]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[49]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[50]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[51]  David W. Embley,et al.  An Active, Object-Oriented, Model-Equivalent Programming Language , 1995, Advances in Object-Oriented Data Modeling.

[52]  David W. Embley,et al.  Ontology generation from tables , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[53]  David W. Embley,et al.  Cardinality Constraints in Semantic Data Models , 1993, Data Knowl. Eng..

[54]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[55]  Johanna Völker,et al.  Ontologies on demand? : A description of the state-of-the-art, applications, challenges and trends for ontology learning from text , 2006 .

[56]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[57]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[58]  Joseph Park FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents , 2015 .

[59]  Christian Kop,et al.  From textual scenarios to a conceptual schema , 2005, Data Knowl. Eng..

[60]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[61]  Terry A. Halpin,et al.  Business Rule Verbalization , 2004, ISTA.

[62]  Tharam S. Dillon,et al.  Differentiating Conceptual Modelling from Data Modelling, Knowledge Modelling and Ontology Modelling and a Notation for Ontology Modelling , 2008, APCCM.

[63]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[64]  John E. Laird,et al.  Toward Integrating Cognitive Linguistics and Cognitive Language Processing , 2016 .

[65]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[66]  David W. Embley,et al.  Green Interaction for Extracting Family Information from OCR'd Books , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[67]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[68]  Thomas L. Packer,et al.  Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction , 2014 .

[69]  David W. Embley,et al.  Ontology-based extraction and structuring of information from data-rich unstructured documents , 1998, CIKM '98.

[70]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[71]  Terry A. Halpin,et al.  Enhanced Verbalization of ORM Models , 2012, OTM Workshops.

[72]  David W. Embley,et al.  Conceptual Modeling Foundations for a Web of Knowledge , 2011, Handbook of Conceptual Modeling.

[73]  Rafael Corchuelo,et al.  ARIEX: Automated ranking of information extractors , 2016, Knowl. Based Syst..

[74]  David W. Embley,et al.  NFQL: the natural forms query language , 1989, ACM Trans. Database Syst..

[75]  Natalya F. Noy,et al.  Semantic integration: a survey of ontology-based approaches , 2004, SGMD.

[76]  David W. Embley,et al.  Theoretical Foundations for Enabling a Web of Knowledge , 2010, FoIKS.

[77]  Patrick Schone,et al.  Genealogical Indexing of Obituaries Using Automatic Processes , 2016 .

[78]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[79]  Heinrich C. Mayr,et al.  Semantic Tagging and Chunk-Parsing in Dynamic Modeling , 2004, NLDB.

[80]  Paul Buitelaar,et al.  Ontology-based information extraction and integration from heterogeneous data sources , 2008, Int. J. Hum. Comput. Stud..

[81]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[82]  Heinrich C. Mayr,et al.  Using KCPM for Defining and Integrating Domain Ontologies , 2004, WISE Workshops.

[83]  David W. Embley Object database development - concepts and principles , 1997 .

[84]  Peter P. Chen English Sentence Structure and Entity-Relationship Diagrams , 1983, Inf. Sci..

[85]  Stephen W. Liddle,et al.  Increasing the Quality of Extracted Information by Reading between the Lines , 2015 .