The process of data extraction from internet sources have been originating the interest of the scientific society for the past years. However there are still no well established standards because of the heterogeneous nature of the information in the Global Network. Nevertheless there is still something in common – all the data is available in HTML format for compatibility reasons. This article presents our methodology and the prototype system we've created to extract data from HTML pages. We use XPath as data extraction language and have developed a methodology for visual wrapper generation. Our approach takes advantage of the implicit correlation between the data and the surrounding structure. Some evaluation tests are given also in order justify our methods.
[1]
Masahiro Hori,et al.
Extensible Framework of Authoring Tools for Web Document Annotation
,
2002
.
[2]
Maurice Bruynooghe,et al.
Information Extraction in Structured Documents Using Tree Automata Induction
,
2002,
PKDD.
[3]
Douglas E. Appelt,et al.
Introduction to Information Extraction Technology
,
1999,
IJCAI 1999.
[4]
Nicholas Kushmerick,et al.
Wrapper Induction for Information Extraction
,
1997,
IJCAI.