论文信息 - Automatic information extraction from web pages

Automatic information extraction from web pages

Many web pages have implicit structure. In this paper, we show the feasibility of automatically extracting data from web pages by using approximate matching techniques. This can be applied to generate automatic wrappers or to notify/display web page differences, web page change monitoring, etc.

Roland H. C. Yap | Budi Rahardjo

[1] Fred Douglis,et al. Tracking and Viewing Changes on the Web , 1996, USENIX Annual Technical Conference.

[2] Craig A. Knoblock,et al. Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[3] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[4] Stephen Soderland,et al. Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[5] Jennifer Widom,et al. The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[6] Arnaud Sahuguet,et al. WysiWyg Web Wrapper Factory (W4F) , 1999 .

[7] Fred Douglis,et al. TopBlend: An Efficient Implementation of HtmlDiff in Java , 2000, WebNet.

[8] Dan Smith,et al. Information extraction for semi-structured documents , 1997 .