Data Extraction from Deep Web Pages

In this paper, we propose a novel model to extract data from Deep Web pages. The model has four layers, among which the access schedule, extraction layer and data cleaner are based on the rules of structure, logic and application. In the experiment section, we apply the new model to three intelligent system, scientific paper retrieval, electronic ticket ordering and resume searching. The results show that the proposed method is robust and feasible.

[1]  Xin Yao,et al.  Stochastic ranking for constrained evolutionary optimization , 2000, IEEE Trans. Evol. Comput..

[2]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[3]  Hui Song,et al.  Data extraction and annotation for dynamic Web pages , 2004, IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 2004.

[4]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[5]  Louis Weitzman,et al.  Visual grammars and incremental parsing for interface languages , 1990, Proceedings of the 1990 IEEE Workshop on Visual Languages.

[6]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[7]  Carlos A. Coello Coello,et al.  A simple multimembered evolution strategy to solve constrained optimization problems , 2005, IEEE Transactions on Evolutionary Computation.

[8]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[9]  Zbigniew Michalewicz,et al.  Evolutionary Algorithms, Homomorphous Mappings, and Constrained Parameter Optimization , 1999, Evolutionary Computation.

[10]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[11]  Gunar E. Liepins,et al.  Some Guidelines for Genetic Algorithms with Penalty Functions , 1989, ICGA.

[12]  Yuanxi Yang,et al.  Robust estimation of geodetic datum transformation , 1999 .

[13]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[14]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[15]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[16]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[17]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[18]  Yuanxi Yang,et al.  ROBUST ESTIMATION FOR A DYNAMIC MODEL OF THE SEA SURFACE , 1999 .

[19]  David Moore,et al.  Code-Red: a case study on the spread and victims of an internet worm , 2002, IMW '02.

[20]  Marc Schoenauer,et al.  ASCHEA: new results using adaptive segregational constraint handling , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).