The HiLeX System for Semantic Information Extraction

The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.

[1]  Jeffrey D. Ullman,et al.  Logical Query Optimization by Proff-Tree Transformation , 1993, J. Comput. Syst. Sci..

[2]  Wolfgang Faber,et al.  Logic Programming and Nonmonotonic Reasoning , 2011, Lecture Notes in Computer Science.

[3]  Wolfgang Faber,et al.  Enhancing Eciency and Expressiveness in Answer Set Programming Systems , 2002 .

[4]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[5]  Catriel Beeri,et al.  On the power of magic , 1987, J. Log. Program..

[6]  David W. Embley,et al.  Notes on Contemporary Table Recognition , 2006, Document Analysis Systems.

[7]  Boris Motik,et al.  Reasoning in Description Logics by a Reduction to Disjunctive Datalog , 2007, Journal of Automated Reasoning.

[8]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[9]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[10]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[11]  Matthias Klusch,et al.  Intelligent Information Agents , 1999, Springer Berlin Heidelberg.

[12]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[13]  Joachim Niehren,et al.  Interactive learning of node selecting tree transducer , 2006, Machine Learning.

[14]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[15]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[16]  Boris Motik,et al.  A Comparison of Reasoning Techniques for Querying Large Description Logic ABoxes , 2006, LPAR.

[17]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers , 2002, NetObjectDays.

[18]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[19]  Inderpal Singh Mumick,et al.  Magic-sets transformation in nonrecursive systems , 1992, PODS '92.

[20]  Martin Gebser,et al.  Conflict-Driven Disjunctive Answer Set Solving , 2008, KR.

[21]  Leopoldo E. Bertossi,et al.  Repairing databases with annotated predicate logic , 2002, NMR.

[22]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[23]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[24]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[25]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[26]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[27]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[28]  Jeffrey D. Ullman,et al.  Principles of database and knowledge-base systems, Vol. I , 1988 .

[29]  Francesco Scarcello,et al.  Disjunctive Stable Models: Unfounded Sets, Fixpoint Semantics, and Computation , 1997, Inf. Comput..

[30]  Fangzhen Lin,et al.  ASSAT: computing answer sets of a logic program by SAT solvers , 2002, Artif. Intell..

[31]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[32]  Leopoldo E. Bertossi,et al.  Consistent Query Answers in Virtual Data Integration Systems , 2005, Inconsistency Tolerance.

[33]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[34]  Jan Chomicki,et al.  Hippo: A System for Computing Consistent Answers to a Class of SQL Queries , 2004, EDBT.

[35]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[36]  Paolo Merialdo,et al.  The Araneus Web-based management system , 1998, SIGMOD '98.

[37]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[38]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[39]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[40]  Sophocles Efremidis,et al.  Complexity characterizations of attribute Grammar languages , 1988 .

[41]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[42]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[43]  Yuliya Lierler,et al.  Disjunctive Answer Set Programming via Satisfiability , 2005, Answer Set Programming.

[44]  Jia-Huai You,et al.  Unfolding partiality and disjunctions in stable model semantics , 2000, TOCL.

[45]  Jan Chomicki,et al.  Computing consistent query answers using conflict hypergraphs , 2004, CIKM '04.

[46]  Robert P. Goldman,et al.  Expressive Planning and Explicit Knowledge , 1996, AIPS.

[47]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[48]  Georg Gottlob,et al.  Disjunctive datalog , 1997, TODS.

[49]  Robert L. Grossman,et al.  Mining Web pages for data records , 2004, IEEE Intelligent Systems.

[50]  Leopoldo E. Bertossi,et al.  The Consistency Extractor System: Querying Inconsistent Databases Using Answer Set Programs , 2007, SUM.

[51]  Leopoldo E. Bertossi,et al.  Deductive databases for computing certain and consistent answers from mediated data integration systems , 2005, J. Appl. Log..

[52]  Peter Baumgartner,et al.  Hyper Tableaux , 1996, JELIA.

[53]  Sergio Greco,et al.  Binding Propagation Techniques for the Optimization of Bound Disjunctive Queries , 2003, IEEE Trans. Knowl. Data Eng..

[54]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[55]  Boris Motik,et al.  Reasoning in description logics using resolution and deductive databases , 2006 .

[56]  Wolfgang Faber,et al.  Magic Sets and their application to data integration , 2005, J. Comput. Syst. Sci..

[57]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[58]  Gottfried Vossen,et al.  The World Wide Web and Databases , 2001, Lecture Notes in Computer Science.

[59]  AdelbergBrad NoDoSEa tool for semi-automatically extracting structured and semistructured data from text documents , 1998 .

[60]  Andreas Behrend,et al.  Soft stratification for magic set based query evaluation in deductive databases , 2003, PODS.

[61]  Alan van Gelser Negation as failure using tight derivations for general logic programs , 1989 .

[62]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[63]  Peter J. Stuckey,et al.  Compiling query constraints , 1994, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[64]  Michael Kifer,et al.  OpenRuleBench: an analysis of the performance of rule engines , 2009, WWW '09.

[65]  Jan Chomicki,et al.  Specifying and Querying Database Repairs using Logic Programs with Exceptions , 2000, FQAS.

[66]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[67]  Wolfgang Faber,et al.  The DLV system for knowledge representation and reasoning , 2002, TOCL.

[68]  Michael J. Maher,et al.  Foundations of Deductive Databases and Logic Programming , 1988 .

[69]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[70]  Nicola Leone,et al.  Disjunctive logic programming with types and objects: The DLV+ system , 2007, J. Appl. Log..

[71]  Divesh Srivastava,et al.  Bottom-Up Evaluation and Query Optimization of Well-Founded Models , 1995, Theor. Comput. Sci..

[72]  Jorge Lobo,et al.  Foundations of disjunctive logic programming , 1992, Logic Programming.

[73]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[74]  Jan Chomicki,et al.  Scalar Aggregation in FD-Inconsistent Databases , 2001, ICDT.

[75]  Ronen Feldman,et al.  TEG—a hybrid approach to information extraction , 2005, Knowledge and Information Systems.

[76]  Leopoldo E. Bertossi,et al.  Logic Programs for Consistently Querying Data Integration Systems , 2003, IJCAI.

[77]  Jos de Bruijn,et al.  D4.2.1 State-of-the-art survey on Ontology Merging and Aligning V1 , 2004 .

[78]  Abdelkader Hameurlain,et al.  Transactions on Large-Scale Data- and Knowledge-Centered Systems I , 2009, Trans. Large-Scale Data- and Knowledge-Centered Systems.

[79]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[80]  Hamid Pirahesh,et al.  Magic is relevant , 1990, SIGMOD '90.

[81]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[82]  Martin Gebser,et al.  GrinGo : A New Grounder for Answer Set Programming , 2007, LPNMR.

[83]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[84]  Jonathan J. Hull,et al.  Document Analysis Systems II - Second Workshop on Document Analysis Systems, DAS 1996, Malvern, PA, USA, October 14-16, 1996, Selected papers , 1998, Series in Machine Perception and Artificial Intelligence.

[85]  Wolfgang Faber,et al.  Enhancing the Magic-Set Method for Disjunctive Datalog Programs , 2004, ICLP.

[86]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[87]  York Sure-Vetter,et al.  From tables to frames , 2005, J. Web Semant..

[88]  Yonatan Aumann,et al.  A Comparative Study of Information Extraction Strategies , 2002, CICLing.

[89]  Hamid Pirahesh,et al.  Cost-based optimization for magic: algebra and implementation , 1996, SIGMOD '96.

[90]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[91]  F. RICCA,et al.  Team-building with answer set programming in the Gioia-Tauro seaport , 2011, Theory and Practice of Logic Programming.

[92]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[93]  Raghu Ramakrishnan,et al.  Review - Magic Sets and Other Strange Ways to Implement Logic Programs , 1999, ACM SIGMOD Digit. Rev..

[94]  Wolfgang Faber,et al.  The INFOMIX system for advanced integration of incomplete and inconsistent data , 2005, SIGMOD '05.

[95]  Francesco Scarcello,et al.  On the complexity of regular-grammars with integer attributes , 2011, J. Comput. Syst. Sci..

[96]  Nicola Leone,et al.  An ASP-Based System for e-Tourism , 2009, LPNMR.

[97]  Marco Manna,et al.  Semantic Clinical Process Management , 2007, Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS'07).

[98]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[99]  Renée J. Miller,et al.  First-order query rewriting for inconsistent databases , 2005, J. Comput. Syst. Sci..

[100]  Donald E. Knuth,et al.  Semantics of context-free languages , 1968, Mathematical systems theory.

[101]  Jean-Marc Pugin,et al.  Efficient Query Answering on Stratified Databases , 1988, FGCS.

[102]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[103]  Timo Soininen,et al.  Extending and implementing the stable model semantics , 2000, Artif. Intell..

[104]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[105]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[106]  Joohyung Lee,et al.  Loop Formulas for Disjunctive Logic Programs , 2003, ICLP.

[107]  Adrian Walker,et al.  Towards a Theory of Declarative Knowledge , 1988, Foundations of Deductive Databases and Logic Programming..

[108]  Sergio Greco,et al.  The PushDown Method to Optimize Chain Logic Programs (Extended Abstract) , 1995, ICALP.

[109]  Kenneth A. Ross,et al.  Modular stratification and magic sets for Datalog programs with negation , 1994, JACM.

[110]  Yasuaki Nakano,et al.  Document Analysis Systems: Theory and Practice , 2003, Lecture Notes in Computer Science.

[111]  Allen Van Gelder,et al.  Negation as Failure using Tight Derivations for General Logic Programs , 1988, J. Log. Program..

[112]  Sergio Greco,et al.  A Logic Programming Approach to the Integration, Repairing and Querying of Inconsistent Databases , 2001, ICLP.

[113]  Chia-Hui Chang,et al.  Automatic information extraction from semi-structured Web pages by pattern discovery , 2003, Decis. Support Syst..

[114]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[115]  Georg Gottlob,et al.  Default Logic as a Query Language , 1997, IEEE Trans. Knowl. Data Eng..

[116]  David W. Embley,et al.  Towards Semantic Understanding -- An Approach Based on Information Extraction Ontologies , 2004, ADC.

[117]  Jan Chomicki,et al.  Consistent Answers from Integrated Data Sources , 2002, FQAS.

[118]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007 .

[119]  NestorovSvetlozar,et al.  Template-based wrappers in the TSIMMIS system , 1997 .

[120]  Andrea Calì,et al.  Query rewriting and answering under constraints in data integration systems , 2003, IJCAI.

[121]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[122]  Rainer Unland,et al.  Objects, Components, Architectures, Services, and Applications for a Networked World , 2003, Lecture Notes in Computer Science.

[123]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[124]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[125]  Nicola Leone,et al.  A Logic-Based System for e-Tourism , 2010, Fundam. Informaticae.

[126]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[127]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..