论文信息 - A Platform for Cross-Lingual, Domain and User Adaptive Web Information Extraction

A Platform for Cross-Lingual, Domain and User Adaptive Web Information Extraction

This paper describes an advanced platform for web information extraction (IE) that enables customization to different domains, languages and users' interests. This platform was the result of the R&D project CROSSMARC which involved both academic and industrial organisations. The platform is composed of a core system for Web IE and a customization infrastructure. The system implements a distributed, multi-agent, open and multilingual architecture that integrates components for (a) collecting domain-specific web pages using crawling and spidering technologies, (b) extracting information from the collected web pages using natural language processing and machine learning techniques, and (c) presenting the extracted information according to users' interests employing user modelling techniques. The platform's customisation infrastructure provides an ontology management system and various customisation methods and tools for the creation of the application specific resources. The platform enables cross-lingual IE, supporting four languages in its current implementation, and has been tested in three different applications.

[1] J. Curran,et al. Domain-specific Web site identification: the CROSSMARC focused Web crawler , 2003 .

[2] Constantine D. Spyropoulos,et al. Information Retrieval and Extraction from the Web: the CROSSMARC approach , 2004, RIAO.

[3] Georgios Paliouras,et al. Annotating Web pages for the needs of Web Information Extraction Applications , 2003, WWW.

[4] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[5] S da SilvaAltigran,et al. A brief survey of web data extraction tools , 2002 .

[6] Emmanuel Cartier,et al. Use of Ontologies for Cross-lingual Information Management in the Web , 2003 .

[7] Scott McDonald,et al. Multilingual XML-Based Named Entity Recognition for E-Retail Domains , 2002, LREC.