Research on Automate Discovery of Deep Web Interfaces

The main means to obtain information from Deep Web is submitting query condition through the provided query interfaces, so it is the first problem that needs to be solved for Deep Web data integration system. At present, most researchers think of query interface is merely defined within the form html tag. This paper firstly proposes the concept of interface block, then designs the interface block location method based on page and vision information, and finally takes the judgment of whether interface block is a query interface or not as the special multi-class classification problems and by applying classification algorithm combining C4.5 decision tree and SVM. The experiment adopts TEL-8 data sets of UIUC, and the findings indicate that the method in this paper get an accuracy of 97.30%, and has good feasibility and practicability.

[1]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[2]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[3]  José Francisco Martínez Trinidad,et al.  Automatic discovery of Web Query Interfaces using machine learning techniques , 2012, Journal of Intelligent Information Systems.

[4]  Yeye He,et al.  Crawling deep web entity pages , 2013, WSDM.

[5]  Clement T. Yu,et al.  Annotating Search Results from Web Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[7]  Li Dong,et al.  D-EEM: A DOM-Tree Based Entity Extraction Mechanism for Deep Web , 2010 .

[8]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[9]  Juliana Freire,et al.  Combining classifiers to identify online databases , 2007, WWW '07.

[10]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[11]  Denis Shestakov On Building a Search Interface Discovery System , 2009, RED.

[12]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Hong Wang,et al.  Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach , 2014, Inf..

[14]  Xin Wang,et al.  Research on discovering deep web entries , 2011, Comput. Sci. Inf. Syst..

[15]  Qinghua Zheng,et al.  Efficient Deep Web Crawling Using Reinforcement Learning , 2010, PAKDD.

[16]  Zhongmin Yan,et al.  Automate discovery of deep web interfaces , 2010, The 2nd International Conference on Information Science and Engineering.