Discovering Test Set Regularities in Relational Domains

Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the training set alone. For example, we have previously shown how FOIL, a relational learner, can learn to classify Web pages by discovering training set regularities in the words occurring on target pages, and on other pages related by hyperlinks. Here we show how the classification accuracy of FOIL on this task can be improved by discovering additional regularities on the test set pages that must be classified. Our approach can be seen as an extension to Kleinberg’s Hubs and Authorities algorithm that analyzes hyperlink relations among Web pages. We present evidence that this new algorithm leads to better test set precision and recall on three binary Web classification tasks where the test set Web pages are taken from different Web sites than the training set.