DIADEM: domain-centric, intelligent, automated data extraction methodology

Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites. Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.

[1]  Tim Furche,et al.  OPAL: automated form understanding for the deep web , 2012, WWW.

[2]  Andrea Calì,et al.  New Expressive Languages for Ontological Query Answering , 2011, AAAI.

[3]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  I-Chen Wu,et al.  ON DESIGN OF BROWSER-ORIENTED DATA EXTRACTION SYSTEM AND THE PLUG-INS , 2010 .

[5]  Hiroyuki Kitagawa,et al.  Wraplet: Wrapping Your Web Contents with a Lightweight Language , 2007, 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System.

[6]  Rob Miller,et al.  Automation and customization of rendered web pages , 2005, UIST.

[7]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Tim Furche,et al.  How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction , 2011, ICWE.

[9]  Georg Gottlob,et al.  Determining relevance of accesses at runtime , 2011, PODS.

[10]  Giorgio Orsi,et al.  Optimizing query answering under ontological constraints , 2011, Proc. VLDB Endow..

[11]  Jeffrey Nichols,et al.  End-user programming of mashups with vegemite , 2009, IUI.

[12]  Tok Wang Ling,et al.  A rule-based query language for HTML , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[13]  Andrea Calì,et al.  Querying Conceptual Schemata with Expressive Equality Constraints , 2011, ER.

[14]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[15]  Andrea Calì,et al.  Query Answering under Non-guarded Rules in Datalog+/- , 2010, RR.

[16]  Georg Gottlob,et al.  Answering Threshold Queries in Probabilistic Datalog+/- Ontologies , 2011, SUM.

[17]  Giorgio Orsi,et al.  Ontological queries: Rewriting and optimization , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Michael Benedikt,et al.  XPath leashed , 2009, CSUR.

[19]  Tim Furche,et al.  OXPath , 2011, Proc. VLDB Endow..

[20]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[21]  Eben M. Haber,et al.  CoScripter: automating & sharing how-to knowledge in the enterprise , 2008, CHI.

[22]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[23]  Maarten Marx,et al.  Conditional XPath , 2005, TODS.

[24]  Tim Furche,et al.  Real understanding of real estate forms , 2011, WIMS '11.

[25]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[26]  Clement T. Yu,et al.  A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration , 2009, Proc. VLDB Endow..

[27]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[28]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[29]  Tim Furche,et al.  Little Knowledge Rules the Web: Domain-Centric Result Page Extraction , 2011, RR.

[30]  Georg Gottlob,et al.  Conjunctive Query Answering in Probabilistic Datalog+/- Ontologies , 2011, RR.

[31]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[32]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[33]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.