Learning Comprehensible Theories from Structured Data

This thesis is concerned with the problem of learning comprehensible theories from structured data and covers primarily classification and regression learning. The basic knowledge representation language is set around a polymorphically-typed, higherorder logic. The general setup is closely related to the learning from propositionalized knowledge and learning from interpretations settings in Inductive Logic Programming. Individuals (also called instances) are represented as terms in the logic. A grammar-like construct called a predicate rewrite system is used to define features in the form of predicates that individuals may or may not satisfy. For learning, decisiontree algorithms of various kinds are adopted. The scope of the thesis spans both theory and practice. On the theoretical side, I study in this thesis 1. the representational power of different function classes and relationships between them; 2. the sample complexity of some commonly-used predicate classes, particularly those involving sets and multisets; 3. the computational complexity of various optimization problems associated with learning and algorithms for solving them; and 4. the (efficient) learnability of different function classes in the PAC and agnostic PAC models. On the practical side, the usefulness of the learning system developed is demontrated with applications in two important domains: bioinformatics and intelligent agents. Specifically, the following are covered in this thesis: 1. a solution to a benchmark multiple-instance learning problem and some useful lessons that can be drawn from it; 2. a successful attempt on a knowledge discovery problem in predictive toxicology, one that can serve as another proof-of-concept that real chemical knowledge can be obtained using symbolic learning; 3. a reworking of an exercise in relational reinforcement learning and some new insights and techniques we learned for this interesting problem; and 4. a general approach for personalizing user agents that takes full advantage of symbolic learning.

[1]  Ashwin Srinivasan,et al.  Warmr: a data mining tool for chemical data , 2001, J. Comput. Aided Mol. Des..

[2]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[3]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[4]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[5]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[6]  A M Richard,et al.  Structure-based methods for predicting mutagenicity and carcinogenicity: are we there yet? , 1998, Mutation research.

[7]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[8]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[9]  Sergei O. Kuznetsov,et al.  Toxicology Analysis by Means of the JSM-method , 2003, Bioinform..

[10]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[11]  Saso Dzeroski,et al.  Integrating Guidance into Relational Reinforcement Learning , 2004, Machine Learning.

[12]  William W. Cohen Grammatically Biased Learning: Learning Logic Programs Using an Explicit Antecedent Description Language , 1994, Artif. Intell..

[13]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[15]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[16]  John W. Lloyd,et al.  Personalisation for user agents , 2005, AAMAS '05.

[17]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[18]  Avrim Blum Rank-r Decision Trees are a Subclass of r-Decision Lists , 1992, Inf. Process. Lett..

[19]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[20]  Peter A. Flach,et al.  Strongly Typed Inductive Concept Learning , 1998, ILP.

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Stefan Kramer,et al.  Stochastic Propositionalization of Non-determinate Background Knowledge , 1998, ILP.

[23]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[24]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[25]  Stefan Kramer,et al.  A Survey of the Predictive Toxicology Challenge 2000-2001 , 2003, Bioinform..

[26]  Yishay Mansour,et al.  Generalization Bounds for Decision Trees , 2000, COLT.

[27]  John Shawe-Taylor,et al.  The Set Covering Machine , 2003, J. Mach. Learn. Res..

[28]  Eric McCreath,et al.  Improving the learning rate by inducing a transition model , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[29]  Hiroki Arimura,et al.  Learning Acyclic First-Order Horn Sentences from Entailment , 1997, ALT.

[30]  Wray L. Buntine,et al.  A further comparison of splitting rules for decision-tree induction , 2004, Machine Learning.

[31]  Elaine J. Weyuker,et al.  Computability, complexity, and languages - fundamentals of theoretical computer science , 2014, Computer science and applied mathematics.

[32]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[33]  Nick Littlestone,et al.  Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm , 2004, Machine Learning.

[34]  Luc De Raedt,et al.  Logical Settings for Concept-Learning , 1997, Artif. Intell..

[35]  Claude Sammut,et al.  The Origins of Inductive Logic Programming: A Prehistoric Tale , 1993 .

[36]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[37]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[38]  Thomas G. Dietterich,et al.  Improved Class Probability Estimates from Decision Tree Models , 2003 .

[39]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[40]  José Hernández-Orallo,et al.  Learning functional logic classification concepts from databases , 2000, WFLP.

[41]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[42]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Michelangelo Ceci,et al.  Mining Model Trees: A Multi-relational Approach , 2003, ILP.

[44]  John W. Lloyd Logic for learning - learning comprehensible theories from structured data , 2003, Cognitive Technologies.

[45]  Fritz Wysotzki,et al.  A Logical Framework for Graph Theoretical Decision Tree Learning , 1997, ILP.

[46]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[47]  Roni Khardon,et al.  Learning closed horn expressions , 2002 .

[48]  Paul E. Utgoff,et al.  An Improved Algorithm for Incremental Induction of Decision Trees , 1994, ICML.

[49]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[50]  Y T Woo,et al.  Development of structure-activity relationship rules for predicting carcinogenic potential of chemicals. , 1995, Toxicology letters.

[51]  HausslerDavid,et al.  A general lower bound on the number of examples needed for learning , 1989 .

[52]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[53]  Stuart L. Crawford Extensions to the CART Algorithm , 1989, Int. J. Man Mach. Stud..

[54]  Hans Ulrich Simon,et al.  General bounds on the number of examples needed for learning probabilistic concepts , 1993, COLT '93.

[55]  Michel Manago,et al.  Knowledge Intensive Induction , 1989, ML.

[56]  Peter A. Flach,et al.  An extended transformation approach to inductive logic programming , 2001, ACM Trans. Comput. Log..

[57]  John Mingers,et al.  An empirical comparison of selection measures for decision-tree induction , 2004, Machine Learning.

[58]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[59]  Luc De Raedt,et al.  First-Order jk-Clausal Theories are PAC-Learnable , 1994, Artif. Intell..

[60]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[61]  Ran El-Yaniv,et al.  On Online Learning of Decision Lists , 2002, J. Mach. Learn. Res..

[62]  David Haussler,et al.  Learning decision trees from random examples , 1988, COLT '88.

[63]  J. Lloyd Foundations of Logic Programming , 1984, Symbolic Computation.

[64]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[65]  John W. Lloyd,et al.  Declarative Programming in Escher , 1995 .

[66]  Stephen Muggleton,et al.  To the international computing community: A new East-West challenge , 1994 .

[67]  Vijay Raghavan,et al.  Monotone term decision lists , 2001, Theor. Comput. Sci..

[68]  Stefan Kramer,et al.  Towards tight bounds for rule learning , 2004, ICML.

[69]  Claude Sammut,et al.  Learning Concepts by Performing Experiments , 1981 .

[70]  Stephen Muggleton,et al.  Machine Invention of First Order Predicates by Inverting Resolution , 1988, ML.

[71]  Kurt Driessens,et al.  Speeding Up Relational Reinforcement Learning through the Use of an Incremental First Order Decision Tree Learner , 2001, ECML.

[72]  Raymond J. Mooney,et al.  Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs , 1995, J. Artif. Intell. Res..

[73]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[74]  Cynthia Rudin,et al.  The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins , 2004, J. Mach. Learn. Res..

[75]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[76]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[77]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[78]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[79]  Avrim Blum,et al.  On-line Algorithms in Machine Learning , 1996, Online Algorithms.

[80]  John Langford,et al.  Estimating Class Membership Probabilities using Classifier Learners , 2005, AISTATS.

[81]  Ashwin Srinivasan,et al.  ILP: A Short Look Back and a Longer Look Forward , 2003, J. Mach. Learn. Res..

[82]  Eduardo Sontag VC dimension of neural networks , 1998 .

[83]  Luc De Raedt,et al.  Clausal Discovery , 1997, Machine Learning.

[84]  D. Sanderson,et al.  Computer Prediction of Possible Toxic Action from Chemical Structure; The DEREK System , 1991, Human & experimental toxicology.

[85]  Michael T. Goodrich,et al.  On the Complexity of Optimization Problems for 3-dimensional Convex Polyhedra and Decision Trees , 1997, Comput. Geom..

[86]  John K. Slaney,et al.  Blocks World revisited , 2001, Artif. Intell..

[87]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[88]  Jan Ramon,et al.  Multi instance neural networks , 2000, ICML 2000.

[89]  Thomas G. Dietterich,et al.  Applying the Waek Learning Framework to Understand and Improve C4.5 , 1996, ICML.

[90]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[91]  John W. Lloyd,et al.  Programming in an Integrated Functional and Logic Language , 1999, J. Funct. Log. Program..

[92]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[93]  Michael J. Pazzani,et al.  Exploring the Decision Forest: An Empirical Investigation of Occam's Razor in Decision Tree Induction , 1993, J. Artif. Intell. Res..

[94]  Ashwin Srinivasan,et al.  Multi-instance tree learning , 2005, ICML.

[95]  De,et al.  Relational Reinforcement Learning , 2001, Encyclopedia of Machine Learning and Data Mining.

[96]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[97]  Ehud Shapiro,et al.  Algorithmic Program Debugging , 1983 .

[98]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[99]  Yin-tak Woo,et al.  Chemical Induction of Cancer , 1995, Birkhäuser Boston.

[100]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[101]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[102]  Michelangelo Ceci,et al.  Top-down induction of model trees with regression and splitting nodes , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[103]  Saso Dzeroski,et al.  Inductive logic programming and learnability , 1994, SGAR.

[104]  Martijn van Otterlo Thesis review: Relational Reinforcement Learning / by Kurt Driessens. - Thesis Katholieke Universiteit Leuven , 2004 .

[105]  Luc De Raedt,et al.  How to Upgrade Propositional Learners to First Order Logic: A Case Study , 2001, Machine Learning and Its Applications.

[106]  Larry A. Rendell,et al.  Learning Structural Decision Trees from Examples , 1991, IJCAI.

[107]  Allen Newell,et al.  Computer science as empirical inquiry: symbols and search (1976) , 1989 .

[108]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[109]  Joshua Goodman An Incremental Decision List Learner , 2002, EMNLP.

[110]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[111]  Thomas Gärtner,et al.  Kernels and Distances for Structured Data , 2004, Machine Learning.

[112]  Yasuhiko Morimoto,et al.  Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules , 1996, VLDB.

[113]  Roni Khardon,et al.  Complexity parameters for first order classes , 2006, Machine Learning.

[114]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[115]  Paul E. Utgoff,et al.  ID5: An Incremental ID3 , 1987, ML.

[116]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[117]  Michael Winikoff,et al.  Learning Within the BDI Framework: An Empirical Analysis , 2005, KES.

[118]  Michael Kearns,et al.  Boosting Theory Towards Practice: Recent Developments in Decision Tree Induction and the Weak Learning Framework , 1996, AAAI/IAAI, Vol. 2.

[119]  Tuomas Sandholm,et al.  What should be minimized in a decision tree: A re-examination , 1995 .

[120]  Xiaobing Wu Knowledge Representation and Inductive Learning with XML , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[121]  Jorg-uwe Kietz,et al.  Controlling the Complexity of Learning in Logic through Syntactic and Task-Oriented Models , 1992 .

[122]  L. D. Raedt Interactive theory revision: an inductive logic programming approach , 1992 .

[123]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[124]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[125]  Judy Kay,et al.  IEMS - The Intelligent Email Sorter , 2002, ICML.

[126]  William W. Cohen Pac-Learning Recursive Logic Programs: Efficient Algorithms , 1994, J. Artif. Intell. Res..

[127]  Tamás Horváth,et al.  Learning logic programs with structured background knowledge , 2001, Artif. Intell..

[128]  Stefan Kramer,et al.  Structural Regression Trees , 1996, AAAI/IAAI, Vol. 1.

[129]  M. Anthony Decision lists and threshold decision lists , 2002 .

[130]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[131]  Nader H. Bshouty,et al.  On the proper learning of axis-parallel concepts , 2003 .

[132]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[133]  Kerry Lea Taylor Autonomous learning by incremental induction and revision , 1996 .

[134]  John Shawe-Taylor,et al.  The Decision List Machine , 2002, NIPS.

[135]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[136]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[137]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[138]  Avrim Blum Learning boolean functions in an infinite attribute space , 1990, STOC '90.

[139]  John W. Lloyd,et al.  Classification of Individuals with Complex Structure , 2000, ICML.

[140]  Fritz Wysotzki,et al.  Learning Relational Concepts with Decision Trees , 1996, ICML.

[141]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[142]  Ivan Bratko,et al.  First Order Regression , 1997, Machine Learning.

[143]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[144]  Takashi Okada,et al.  Characteristic Substructures and Properties in Chemical Carcinogens Studied by the Cascade Model , 2003, Bioinform..

[145]  Allen Newell,et al.  Computer science as empirical inquiry: symbols and search , 1976, CACM.

[146]  Roni Khardon,et al.  Learning Function-Free Horn Expressions , 1999, Machine Learning.

[147]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[148]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[149]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[150]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[151]  Usama M. Fayyad,et al.  What Should Be Minimized in a Decision Tree? , 1990, AAAI.

[153]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[154]  Dana Ron,et al.  An experimental and theoretical comparison of model selection methods , 1995, COLT '95.

[155]  Peter Auer,et al.  On Learning From Multi-Instance Examples: Empirical Evaluation of a Theoretical Approach , 1997, ICML.

[156]  Claude Sammut,et al.  LEARNING CONCEPTS BY ASKING QUESTIONS , 1998 .

[157]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[158]  Peter Auer,et al.  Theory and Applications of Agnostic PAC-Learning with Small Decision Trees , 1995, ICML.

[159]  Alessandro Giuliani,et al.  Putting the Predictive Toxicology Challenge Into Perspective: Reflections on the Results , 2003, Bioinform..

[160]  Stefan Kramer Relational learning vs. propositionalization: Investigations in inductive logic programming and propositional machine learning , 2000 .

[161]  Donald E. Knuth The Dangers of Computer-Science Theory , 1973 .

[162]  Michael Schmitt,et al.  Exact VC-Dimension of Boolean Monomials , 1996, Inf. Process. Lett..

[163]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[164]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[165]  Stephen Muggleton,et al.  A Learnability Model for Universal Representations and Its Application to Top-down Induction of Decision Trees , 1995, Machine Intelligence 15.

[166]  Raymond J. Mooney,et al.  First-Order Theory Revision , 1991, ML.

[167]  Shan-Hwei Nienhuys-Cheng,et al.  Foundations of Inductive Logic Programming , 1997, Lecture Notes in Computer Science.

[168]  John Langford,et al.  Quantitatively tight sample complexity bounds , 2002 .

[169]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[170]  Llew Mason,et al.  Margins and combined classifiers , 1999 .

[171]  John W. Lloyd,et al.  Symbolic Learning for Adaptive Agents , 2003 .

[172]  Geoffrey I. Webb Further Experimental Evidence against the Utility of Occam's Razor , 1996, J. Artif. Intell. Res..

[173]  Peter L. Bartlett,et al.  Generalization in Decision Trees and DNF: Does Size Matter? , 1997, NIPS.

[174]  Saso Dzeroski,et al.  PAC-learnability of determinate logic programs , 1992, COLT '92.

[175]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[176]  Stefan Kramer,et al.  Inducing classification and regression trees in first order logic , 2001 .

[177]  M. Cameron-Jones,et al.  First Order Learning, Zeroth Order Data , 1993 .

[178]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[179]  John W. Lloyd,et al.  Modal Higher-order Logic for Agents , 2004 .

[180]  Thomas P. Hayes,et al.  Reductions Between Classification Tasks , 2004, Electron. Colloquium Comput. Complex..

[181]  Norman Ramsey,et al.  Literate programming simplified , 1994, IEEE Software.

[182]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[183]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[184]  Sylvie Thiébaux,et al.  Exploiting First-Order Regression in Inductive Policy Selection , 2004, UAI.

[185]  Alonzo Church,et al.  A formulation of the simple theory of types , 1940, Journal of Symbolic Logic.

[186]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[187]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[188]  William W. Cohen Pac-learning Recursive Logic Programs: Negative Results , 1994, J. Artif. Intell. Res..

[189]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..