Extracting Attributes and their Values from Web pages

We propose a method for extracting attributes and their values from Web pages. Our method makes use of word distributions estimated from plain Web pages. The key idea is to estimate word distribution by consulting ontologies built from HTML tables. In a series of experiments, we show that estimated word distributions are useful for extracting attributes and their values in various kinds of HTML representations other than tables.