Supporting Case Acquisition and Labelling in the Cotext of Web Mining

Case acquisition and labelling are important bottlenecks for predictive data mining. In the web context, a cascade of supporting techniques can be used, from general ones such as user interfaces, through filtering based on keyword frequency, to web-specific techniques exploiting public search engines. We show how a synergistic application of multiple techniques can be helpful in obtaining and pre-processing textual data, in particular for ILP-based web mining. The (two-fold) learning task itself consist in construction and disambiguation of categorisation rules, which are to process the results returned by web search engines.