Thesis: automatic ontology generation from web tabular structures

Turning the current Web into a Semantic Web requires automatic approaches for document annotation, since manual approaches will not scale in general. The focus of the thesis is on automatic transformation of arbitrary table-like structures into knowledge models, i.e., ontologies. The presented work is based on Hurst's table model and consists of a methodology, an accompanying implementation named TARTAR, and a thorough evaluation. The evaluation showed over 80% success rate of automatic transformation of tables into semantic representations and 100% accuracy in the task of query answering over the table contents.

[1]  Matjaz Gams,et al.  Domain-dependent information gathering agent , 2002, Expert Syst. Appl..

[2]  Filippo Neri,et al.  Machine Learning for Information Extraction , 1997, SCIE.

[3]  Joachim Biskup,et al.  Extracting information from heterogeneous information sources using ontologically specified target views , 2003, Inf. Syst..

[4]  Kenneth A. Ross,et al.  The well-founded semantics for general logic programs , 1991, JACM.

[5]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[6]  James A. Hendler,et al.  Agents and the Semantic Web , 2001, IEEE Intell. Syst..

[7]  Jun'ichi Tsujii,et al.  A method to integrate tables of the World Wide Web , 2001 .

[8]  Michael Kifer,et al.  Logical foundations of object-oriented and frame-based languages , 1995, JACM.

[9]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[10]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[11]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[12]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[13]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[14]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[15]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[16]  Mary Elaine Califf and Raymond J. Mooney,et al.  Applying ILP-based Techniques to Natural Language Information Extraction: An Experiment in Relational Learning , 1997 .

[17]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[18]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[19]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[20]  Jun'ichi Tsujii,et al.  Extracting ontologies from World Wide Web via HTML tables , 2001 .

[21]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[22]  J. Handler Making sense out of agents , 1999, IEEE Intelligent Systems and their Applications.

[23]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[24]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[25]  William W. Cohen Learning and Discovering Structure in Web Pages , 2003, IEEE Data Eng. Bull..

[26]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[27]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[28]  Matthew Hurst Layout and language: an efficient algorithm for detecting text blocks based on spatial and linguistic evidence , 2000, IS&T/SPIE Electronic Imaging.

[29]  Dieter Fensel,et al.  Ontobroker: Ontology Based Access to Distributed and Semi-Structured Information , 1999, DS-8.

[30]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[31]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[32]  W. N. Borst,et al.  Construction of Engineering Ontologies for Knowledge Sharing and Reuse , 1997 .

[33]  Paola Velardi,et al.  The Usable Ontology: An Environment for Building and Assessing a Domain Ontology , 2002, SEMWEB.

[34]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[35]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[36]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[37]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[38]  Dudley J. Cowden,et al.  Handbook of Tabular Presentation , 1944 .

[39]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[40]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[41]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[42]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[43]  Yalin Wang,et al.  Zone content classification and its performance evaluation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[44]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[45]  Cui Tao,et al.  Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure , 2002, ER.

[46]  David Faure,et al.  First experiences of using semantic knowledge learned by ASIUM for information extraction task using INTEX , 2000, ECAI Workshop on Ontology Learning.

[47]  Apostolos Antonacopoulos,et al.  Web Document Analysis: Challenges and Opportunities , 2003 .

[48]  Rudi Studer,et al.  An Approach for Step-By-Step Query Refinement in the Ontology-Based Information Retrieval , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[49]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[50]  Yoav Shoham,et al.  An overview of agent-oriented programming , 1997 .

[51]  Steffen Staab,et al.  On2broker: Semantic-Based Access to Information Sources at the WWW , 1999, Intelligent Information Integration.

[52]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[53]  David J. DeWitt,et al.  The Object-Oriented Database System Manifesto , 1994, Building an Object-Oriented Database System, The Story of O2.

[54]  Amanda Spink,et al.  Searchers, The Subjects They Search, And Sufficiency: A Study Of A Large Sample Of Excite Searches , 1998, WebNet.

[55]  Matjaz Gams,et al.  Transforming Arbitrary Tables into F-Logic Frames with TARTAR , 2005 .

[56]  William W. Cohen,et al.  Learning Page-Independent Heuristics for Extracting Data from Web Pages , 1999, Comput. Networks.

[57]  Dieter Fensel,et al.  Knowledge Engineering: Principles and Methods , 1998, Data Knowl. Eng..

[58]  Matjaz Gams,et al.  A semi-universal e-commerce agent: domain-dependant information gathering , 2003 .

[59]  吉田 稔 A method for information extraction from tables and lists , 2003 .

[60]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[61]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[62]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[63]  York Sure-Vetter,et al.  Ontology-Based Information Integration in the Automotive Industry , 2003, SEMWEB.

[64]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[65]  York Sure-Vetter,et al.  From tables to frames , 2005, J. Web Semant..

[66]  Ian Horrocks,et al.  OIL in a Nutshell , 2000, EKAW.

[67]  Matthew Hurst,et al.  Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks , 2000, COLING.

[68]  Shona Douglas,et al.  Layout and language: preliminary investigations in recognizing the structure of tables , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[69]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[70]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[71]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[72]  Steffen Staab,et al.  Learning Taxonomic Relations from Heterogeneous Evidence , 2004 .

[73]  Joan H. Coll,et al.  Graphs and tables: a four-factor experiment , 1994, CACM.

[74]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[75]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[76]  Pattie Maes,et al.  Agents as Mediators in Electronic Commerce , 1999 .

[77]  Yiming Yang,et al.  Learning Table Extraction from Examples , 2004, COLING.

[78]  Catherine Faron-Zucker,et al.  Learning ontologies from RDF annotation , 2001 .

[79]  Hyacinth S. Nwana,et al.  Software agents: an overview , 1996, The Knowledge Engineering Review.

[80]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993 .

[81]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[82]  Jacques Ferber,et al.  Multi-agent systems - an introduction to distributed artificial intelligence , 1999 .

[83]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[84]  Raphael Volz,et al.  Semi-automatic Ontology Acquisition from a Corporate Intranet , 2000 .

[85]  Alexiei Dingli,et al.  Using Adaptive Information Extraction for Effective Human-Centred Document Annotation , 2003, Text Mining.

[86]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[87]  Matjaz Gams A Uniform Internet-Communicative Agent , 2001, Electron. Commer. Res..

[88]  Asunción Gómez-Pérez,et al.  Ontological Engineering: With Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web , 2004, Advanced Information and Knowledge Processing.

[89]  Letizia Tanca,et al.  Logic Programming and Databases , 1990, Surveys in Computer Science.

[90]  Y. Shoham,et al.  What we talk about when we talk about software agents , 1999, IEEE Intell. Syst..

[91]  Paola Velardi,et al.  Using text processing techniques to automatically enrich a domain ontology , 2001, FOIS.

[92]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[93]  Steffen Staab,et al.  Automatic Acquisition of Taxonomies from Text: FCA meets NLP , 2003 .

[94]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[95]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[96]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[97]  Matthias Klusch,et al.  Information agent technology for the Internet: A survey , 2001, Data Knowl. Eng..

[98]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[99]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[100]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[101]  Michael Wooldridge,et al.  Introduction to multiagent systems , 2001 .

[102]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[103]  Wei-Kuan Shih,et al.  Semantic search on Internet tabular information extraction for answering queries , 2000, CIKM '00.

[104]  Steffen Staab,et al.  Clustering Concept Hierarchies from Text , 2004, LREC.

[105]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[106]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[107]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[108]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[109]  Michael Wooldridge,et al.  Agent technology: foundations, applications, and markets , 1998 .

[110]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[111]  Timothy W. Finin,et al.  Enabling Technology for Knowledge Sharing , 1991, AI Mag..

[112]  Amelia Badica,et al.  Intelligent Agents in E-Commerce , 2006 .

[113]  Asunción Gómez-Pérez,et al.  Six challenges for the Semantic Web , 2002, KR 2002.

[114]  Alon Y. Halevy,et al.  Intelligent Internet systems , 2000, Artif. Intell..

[115]  Michael R. Genesereth,et al.  Software agents , 1994, CACM.

[116]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[117]  Nicholas R. Jennings,et al.  Applying agent technology , 1995, Appl. Artif. Intell..

[118]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[119]  Dieter Fensel,et al.  Ontologies: A silver bullet for knowledge management and electronic commerce , 2002 .

[120]  Stefan Schulz,et al.  Towards Very Large Terminological Knowledge Bases: A Case Study from Medicine , 2000, Canadian Conference on AI.

[121]  Daniel P. Lopresti,et al.  Why table ground-truthing is hard , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[122]  Kentaro Torisawa,et al.  Extracting Attributes and their Values from Web pages , 2003, Web Document Analysis.

[123]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[124]  David W. Embley,et al.  Ontology generation from tables , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[125]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.