Incorporating Optional Labeling And Dynamic Tagging With Combining Tag And Value Similarity

Query result pages are generated from the web databases based on the query given by the user. From these query result pages data are extracted automatically. The data extraction is based on the combined tag and value similarity technique. The Query Result Record (QRRs) in the query result pages are identified and segmented. The segmented QRRs are then aligned in to a table. The non contiguous QRRs are also considered which is induced by the auxiliary information. We propose new techniques called Optional Labeling and Dynamic Tag structuring which improves the efficiency in data extraction. Initially all the tags are stored temporarily in a database, from where relevant tags are extracted. The tag structuring is handled dynamically so that more accurate extraction is made possible.

[1]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[2]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[3]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[4]  Yi Liu,et al.  Combining Tag and Value Similarity for Data Extraction and Alignment , 2012, IEEE Transactions on Knowledge and Data Engineering.

[5]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[6]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[7]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[8]  Cui Tao,et al.  Automatic Hidden-Web Table Interpretation by Sibling Page Comparison , 2007, ER.

[9]  Jian-Yun Nie Heterogeneous Web Data Extraction using Ontology , 2001 .

[10]  Frederick H. Lochovsky,et al.  Data-rich section extraction from HTML pages , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[11]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[12]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[13]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[15]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[16]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.