ODE: Ontology-assisted data extraction

Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.

[1]  Ricardo A. Baeza-Yates,et al.  Algorithms for string searching , 1989, SIGF.

[2]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[3]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[4]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[7]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[8]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[9]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[10]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[11]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[12]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[15]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[16]  Heterogeneous Web Data Extraction using Ontology , 2001 .

[17]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[18]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[19]  Carlos Alberto Heuser,et al.  Semiautomatic Generation of Data-Extraction Ontologies from Relational Databases , 2002, SBBD.

[20]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[21]  P. Jana,et al.  MAXIMUM-ENTROPY APPROACH , 2003 .

[22]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[23]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[24]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[25]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[26]  Sarit Kraus,et al.  KBFS: K-Best-First Search , 2003, Annals of Mathematics and Artificial Intelligence.

[27]  Nan Wang,et al.  Automatic composite wrapper generation for semi-structured biological data based on table structure identification , 2004, SGMD.

[28]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[29]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[30]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[31]  Avigdor Gal,et al.  OntoBuilder: fully automatic extraction and consolidation of ontologies from Web sources , 2004, Proceedings. 20th International Conference on Data Engineering.

[32]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[33]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[34]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[35]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[36]  Clement T. Yu,et al.  Bootstrapping Domain Ontology for Semantic Web Services from Source Web Sites , 2005, TES.

[37]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[38]  Weifeng Su,et al.  Holistic Schema Matching for Web Query Interfaces , 2006, EDBT.

[39]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[40]  Kevin Chen-Chuan Chang,et al.  Automatic complex schema matching across Web query interfaces: A correlation mining approach , 2006, TODS.

[41]  Cui Tao,et al.  Automatic Hidden-Web Table Interpretation by Sibling Page Comparison , 2007, ER.

[42]  Clement T. Yu,et al.  Annotating Structured Data of the Deep Web , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[43]  Cui Tao,et al.  Automatic hidden-web table interpretation, conceptualization, and semantic annotation , 2009, Data Knowl. Eng..

[44]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .