OntoWrap- extracting data records from search engine results pages using ontological technique

Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the search engine results pages generally have limitations such as the inability to check the similarity of tree structures accurately. Our study on the properties of data records shows that these data records located in search engine results pages are not only having similar visual properties and tree structures, but they are also related semantically in their contents. In this context, we propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records. We find that wrappers designed based on ontological technique are able to reduce the number of potential data regions to be extracted, thus they are able to improve the data extraction accuracy. We then use visual cue from the browser rendering engine to locate and extract the relevant data region from the web page by measuring the size of text and image of data records. Experimental results indicate that our technique is robust and performs better than the existing state of the art visual based wrappers.

[1]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[2]  Clement T. Yu,et al.  Bootstrapping Domain Ontology for Semantic Web Services from Source Web Sites , 2005, TES.

[3]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[4]  Christiane Fellbaum,et al.  Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms , 1998 .

[5]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[6]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[7]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[8]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[9]  Jian-Yun Nie Heterogeneous Web Data Extraction using Ontology , 2001 .

[10]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[12]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[13]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[14]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[15]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[16]  Yonghuai Liu,et al.  Visual Segmentation-Based Data Record Extraction from Web Documents , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[17]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[18]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[19]  Weifeng Su,et al.  ODE: Ontology-assisted data extraction , 2009, TODS.