论文信息 - Extracting Attributes and their Values from Web pages

Extracting Attributes and their Values from Web pages

We propose a method for extracting attributes and their values from Web pages. Our method makes use of word distributions estimated from plain Web pages. The key idea is to estimate word distribution by consulting ontologies built from HTML tables. In a series of experiments, we show that estimated word distributions are useful for extracting attributes and their values in various kinds of HTML representations other than tables.

Kentaro Torisawa | Minoru Yoshida | Jun'ichi Tsujii

[1] Jun'ichi Tsujii,et al. Extracting ontologies from World Wide Web via HTML tables , 2001 .

[2] Craig A. Knoblock,et al. A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[3] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[4] I. Good. THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[5] Hsin-Hsi Chen,et al. Mining Tables from Large Scale HTML Texts , 2000, COLING.

[6] Dayne Freitag,et al. Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[7] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.