INFERRING AND REVISING THEORIES WITH CONFIDENCE: ANALYZING BILINGUALISM IN THE 1901 CANADIAN CENSUS

This paper shows how machine learning can help in analyzing and understanding historical change. Using data from the Canadian census of 1901, we discover the influences on bilingualism in Canada at the beginning of the last century. The discovered theories partly agree with, and partly complement, the existing views of historians on this question. Our approach, based around a decision tree, not only infers theories directly from data, but also evaluates existing theories and revises them to improve their consistency with the data. One novel aspect of this work is the use of confidence intervals to determine which factors are both statistically and practically significant, and thus contribute appreciably to the overall accuracy of the theory. When inducing a decision tree directly from data, confidence interrvals determine when new tests should be added. If an existing theory is being evaluated, confidence intervals also determine when old tests should be replaced, or deleted, to improve the theory. Our aim is to minimize the changes made to an existing theory to accommodate the new data. To this end, we propose a semantic measure of similarity between trees and demonstrate how this can be used to limit the changes made.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Eibe Frank,et al.  Pruning Decision Trees and Lists , 2000 .

[3]  David Jensen,et al.  Knowledge Discovery Through Induction with Randomization Testing , 1991 .

[4]  Book Reviews Unwilling Idlers: The Urban Unemployed and Their Families in Late Victorian Canada by Peter Baskerville and Eric W. Sager , 1999 .

[5]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[6]  L. Harlow,et al.  What if there were no significance tests , 1997 .

[7]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[8]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[9]  Stan Matwin,et al.  Inferring and Revising Theories with Confidence: Analyzing the 1901 Canadian Census , 2000 .

[10]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[11]  D. Madigan,et al.  Eliciting prior information to enhance the predictive performance of Bayesian graphical models , 1995 .

[12]  M Ornstein,et al.  Analysis of Household Samples: The 1901 Census of Canada , 2000, Historical methods.

[13]  Gordon Darroch,et al.  Property and Inequality in Victorian Ontario: Structural Patterns and Cultural Communities in the 1871 Census , 1994 .

[14]  Peter Baskerville,et al.  Unwilling Idlers: The Urban Unemployed and Their Families in Late Victorian Canada , 1998 .

[15]  Adrian E. Raftery,et al.  Enhancing the Predictive Performance of BayesianGraphical , 1995 .

[16]  Thomas G. Dietterich,et al.  Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[17]  Chad Gaffield Linearity, Nonlinearity, and the Competing Constructions of Social Hierarchy in Early Twentieth-Century Canada: The Question of Language in 1901 , 2000 .

[18]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[21]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[22]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[23]  R. Mooney Induction over the unexplained: Using overly-general domain theories to aid concept learning , 2004, Machine Learning.

[24]  Clayne Pope Property and Inequality in Victorian Ontario: Structural Patterns and Cultural Communities in the 1871 Census. By Gordon Darroch and Lee Soltow. Toronto: University of Toronto Press, 1994. Pp. xvi, 280. $54.00, cloth; $24.00, paper , 1995 .

[25]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[26]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[27]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[28]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.