Leveraging Webpage Classification for Data Object Recognition

Data-rich webpages are providing an increasingly important data source for web applications. While the problem of data object recognition is intensively discussed, it is mostly addressed as a separated process from the frontier task of relevant webpage identification. In this paper, we propose a method to leverage the classification result of data-rich webpages for efficient and scalable data object recognition. A novel context information is proposed, which can be inferred from the webpage classification and exploited in the bottom-up data object recognition. Experimental results show that the context information brings a 19% improvement in the running efficiency of the bottom- up data object recognition.

[1]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[4]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[5]  Ping Zhong,et al.  A Generalized Hidden Markov Model Approach for Web Information Extraction , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[6]  Lizhu Zhou,et al.  A hybrid method for Web data extraction , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[7]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[8]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[9]  Gabriel Valiente,et al.  An efficient bottom-up distance between trees , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[10]  Ling Lin,et al.  Using Structured Tokens to Identify Webpages for Data Extraction , 2007, APWeb/WAIM.

[11]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.