Automatic Annotation for the Generation of Extraction Rules

Current Web information extraction systems are supervised systems which require manual annotation of training instances in order to learn extraction rules. The annotation is tedious and subject to changes when Web sites upgrade. In this paper, we present a finite-state-transducer-based method of automatic annotation, which can deal with pages with missing attributes, multiple-valued attributes, multi-ordering attributes. Moreover, we also argument it with probability theory to reduce the uncertainty of the state machine. The experimental results show that our algorithm can annotate Web pages efficiently and accurately and thus speed-up extraction rules learning in Web information extraction systems.