A Platform for Cross-Lingual, Domain and User Adaptive Web Information Extraction

This paper describes an advanced platform for web information extraction (IE) that enables customization to different domains, languages and users' interests. This platform was the result of the R&D project CROSSMARC which involved both academic and industrial organisations. The platform is composed of a core system for Web IE and a customization infrastructure. The system implements a distributed, multi-agent, open and multilingual architecture that integrates components for (a) collecting domain-specific web pages using crawling and spidering technologies, (b) extracting information from the collected web pages using natural language processing and machine learning techniques, and (c) presenting the extracted information according to users' interests employing user modelling techniques. The platform's customisation infrastructure provides an ontology management system and various customisation methods and tools for the creation of the application specific resources. The platform enables cross-lingual IE, supporting four languages in its current implementation, and has been tested in three different applications.