Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners

Two common characteristics of relational data sets -- concentrated linkage and relational auto-correlation -- can cause traditional methods of evaluation to greatly overestimate the accuracy of induced models on test sets. We identify these characteristics, define quantitative measures of their severity, and explain how they produce this bias. We show how linkage and autocorrelation affect estimates of model accuracy by applying FOIL to synthetic data and to data drawn from the Internet Movie Database. We show how a modified sampling procedure can eliminate the bias.

[1]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[2]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[3]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[4]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[5]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[6]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[7]  Corinna Cortes,et al.  Communities of interest , 2001, Intell. Data Anal..

[8]  Jennifer Neville,et al.  Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning , 2002, ICML.

[9]  Mike Rees,et al.  5. Statistics for Spatial Data , 1993 .

[10]  Stephen Muggleton,et al.  Learning Stochastic Logic Programs , 2000, Electron. Trans. Artif. Intell..

[11]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[12]  J. Kleinberg,et al.  Authoritative Soueces in a Hyper-linked Environment , 1998, SODA 1998.

[13]  Peter A. Flach,et al.  IBC: A First-Order Bayesian Classifier , 1999, ILP.

[14]  Robert Haining,et al.  Statistics for spatial data: by Noel Cressie, 1991, John Wiley & Sons, New York, 900 p., ISBN 0-471-84336-9, US $89.95 , 1993 .

[15]  D. A. Bell,et al.  Applied Statistics , 1953, Nature.

[16]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[17]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[18]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[19]  Jennifer Neville,et al.  Correlation and Sampling in Relational Data Mining , 2001 .