论文信息 - Automatic Annotation for the Generation of Extraction Rules

Automatic Annotation for the Generation of Extraction Rules

Current Web information extraction systems are supervised systems which require manual annotation of training instances in order to learn extraction rules. The annotation is tedious and subject to changes when Web sites upgrade. In this paper, we present a finite-state-transducer-based method of automatic annotation, which can deal with pages with missing attributes, multiple-valued attributes, multi-ordering attributes. Moreover, we also argument it with probability theory to reduce the uncertainty of the state machine. The experimental results show that our algorithm can annotate Web pages efficiently and accurately and thus speed-up extraction rules learning in Web information extraction systems.

Rong Chen | Yufei Shi

[1] Hongjun Lu,et al. iASA: Learning to Annotate the Semantic Web , 2005, J. Data Semant..

[2] Khaled Shaalan,et al. CRITERIA FOR EVALUATING INFORMATION EXTRACTION SYSTEMS , 2006 .

[3] I. V. Ramakrishnan,et al. Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[4] Khaled Shaalan,et al. A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5] Stephen Soderland,et al. Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[6] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[7] Arthur Stutt,et al. MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[8] Steffen Staab,et al. S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[9] Alexiei Dingli,et al. Automatic semantic annotation using unsupervised information extraction and integration , 2003 .

[10] Dayne Freitag,et al. Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.