Towards learning rules from natural texts

In this paper, we consider the problem of inductively learning rules from specific facts extracted from texts. This problem is challenging due to two reasons. First, natural texts are radically incomplete since there are always too many facts to mention. Second, natural texts are systematically biased towards novelty and surprise, which presents an unrepresentative sample to the learner. Our solutions to these two problems are based on building a generative observation model of what is mentioned and what is extracted given what is true. We first present a Multiple-predicate Bootstrapping approach that consists of iteratively learning if-then rules based on an implicit observation model and then imputing new facts implied by the learned rules. Second, we present an iterative ensemble colearning approach, where multiple decision-trees are learned from bootstrap samples of the incomplete training data, and facts are imputed based on weighted majority.

[1]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[2]  R. Mike Cameron-Jones,et al.  Efficient top-down induction of logic programs , 1994, SGAR.

[3]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[4]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[5]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[6]  J. Ross Quinlan,et al.  Learning logical definitions from relations , 1990, Machine Learning.

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[9]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[10]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[11]  Joost N. Kok,et al.  Efficient Frequent Query Discovery in FARMER , 2003, PKDD.

[12]  Luc De Raedt,et al.  Multiple Predicate Learning in Two Inductive Logic Programming Settings , 1996, Log. J. IGPL.

[13]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[14]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[15]  William W. Cohen WHIRL: A word-based information representation language , 2000, Artif. Intell..

[16]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.