Semi-Structured Complex List Extraction

The semi-structured information available in HTML and similar documents provide valuable information that can be used for information extraction applications. This information together with other technical information about how to retrieve pages can be used to automatically extract pieces and various types of lists. The goal is to put as much intelligently as possible in the system so that as little knowledge and work as possible is required by the users, i.e. a user-driven extraction system. The advantage of a userdriven system is that the service provided by the system is available not only for experts, but for also ordinary users and thereby making the service available for a wide audience. A problem with some lists in documents are that the structure is different for the elements in the lists, and thus it becomes more difficult to take advantage of the semi-structural information. The agent-oriented system described in this paper allows a user without expert skills to train an extraction system to extract singleton, lists, and also complex lists. The complex list type shall be able to handle these complex lists with varied structure. The experiments conducted show that a user can train the system to extract information pieces from different sites with very little knowledge and small amount of work. However, there are still additional work needed to be able to handle more advanced extraction tasks.