Automatically Extracting Form Labels

We describe a machine-learning-based approach for extracting attribute labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate data that reside in online databases and that are hidden behind form interfaces, including schema matching and clustering, and hidden-Web crawlers. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of learning classifiers to identify form labels. Our preliminary experiments show this approach is promising and has high accuracy.

[1]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[2]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[3]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[4]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[5]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[6]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[7]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[8]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Clement T. Yu,et al.  Automatic extraction of web search interfaces for interface schema integration , 2004, WWW Alt. '04.

[10]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[11]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[12]  Kevin Chen-Chuan Chang,et al.  MetaQuerier over the Deep Web: Shallow Integration across Holistic Sources , 2004 .

[13]  Clement T. Yu,et al.  Merging interface schemas on the deep Web via clustering aggregation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[15]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[16]  Clement T. Yu,et al.  WebIQ: Learning from the Web to Match Deep-Web Query Interfaces , 2006, 22nd International Conference on Data Engineering (ICDE'06).