Information Extraction is an important research topic in data mining. In this paper we introduce a web information extraction approach based on similar patterns, in which the construction of pattern library is a knowledge acquisition bottleneck. We use a method based on similarity computation to automatically acquire patterns from large-scale corpus. According to the given seed patterns, relevant patterns can be learned from unlabeled training web pages. The generated patterns can be put to use after little manual correction. Compared to other algorithms, our approach requires much less human intervention and avoids the necessity of hand-tagging training corpus. Experimental results show that the acquired patterns achieve IE precision of 79.45% and recall of 66.51% in open test.
[1]
Ralph Grishman,et al.
Automatic Acquisition of Domain Knowledge for Information Extraction
,
2000,
COLING.
[2]
Scott B. Huffman,et al.
Learning information extraction patterns from examples
,
1995,
Learning for Natural Language Processing.
[3]
Stephen Soderland,et al.
Learning Information Extraction Rules for Semi-Structured and Free Text
,
1999,
Machine Learning.
[4]
Ellen Riloff,et al.
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
,
1999,
AAAI/IAAI.
[5]
Satoshi Sekine,et al.
Towards Automatic Acquisition of Patterns for Information Extraction.
,
1999
.