An Empirical Investigation Into Deep and Shallow Rule Learning

Inductive rule learning is arguably among the most traditional paradigms in machine learning. Although we have seen considerable progress over the years in learning rule-based theories, all state-of-the-art learners still learn descriptions that directly relate the input features to the target concept. In the simplest case, concept learning, this is a disjunctive normal form (DNF) description of the positive class. While it is clear that this is sufficient from a logical point of view because every logical expression can be reduced to an equivalent DNF expression, it could nevertheless be the case that more structured representations, which form deep theories by forming intermediate concepts, could be easier to learn, in very much the same way as deep neural networks are able to outperform shallow networks, even though the latter are also universal function approximators. However, there are several non-trivial obstacles that need to be overcome before a sufficiently powerful deep rule learning algorithm could be developed and be compared to the state-of-the-art in inductive rule learning. In this paper, we therefore take a different approach: we empirically compare deep and shallow rule sets that have been optimized with a uniform general mini-batch based optimization algorithm. In our experiments on both artificial and real-world benchmark data, deep rule networks outperformed their shallow counterparts, which we take as an indication that it is worth-while to devote more efforts to learning deep rule structures from data.

[1]  Chris Aldrich,et al.  ANN-DT: an algorithm for extraction of decision trees from artificial neural networks , 1999, IEEE Trans. Neural Networks.

[2]  Ryszard S. Michalski,et al.  Hypothesis-Driven Constructive Induction in AQ17-HCI: A Method and Experiments , 1994, Machine Learning.

[3]  Tomaso A. Poggio,et al.  When and Why Are Deep Networks Better Than Shallow Ones? , 2017, AAAI.

[4]  Bernhard Pfahringer,et al.  Controlling Constructive Induction in CIPF: An MDL Approach , 1994, ECML.

[5]  Alen D. Shapiro,et al.  Structured induction in expert systems , 1987 .

[6]  Johannes Fürnkranz,et al.  Re-training Deep Neural Networks to Facilitate Boolean Concept Extraction , 2017, DS.

[7]  Krysia Broda,et al.  Predicate Invention in Inductive Logic Programming , 2012, ICCSW.

[8]  Joachim Diederich,et al.  Survey and critique of techniques for extracting rules from trained artificial neural networks , 1995, Knowl. Based Syst..

[9]  João Guerreiro,et al.  A Unified Approach to the Extraction of Rules from Artificial Neural Networks and Support Vector Machines , 2010, ADMA.

[10]  Stefan Kramer,et al.  A Brief History of Learning Symbolic Higher-Level Representations from Data (And a Curious Look Forward) , 2020, IJCAI.

[11]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[12]  Sabrina Hirsch,et al.  Logic Minimization Algorithms For Vlsi Synthesis , 2016 .

[13]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[14]  Johannes Fürnkranz,et al.  All-in Text: Learning Document, Label, and Word Representations Jointly , 2016, AAAI.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[17]  Eyke Hüllermeier,et al.  Top-Down Induction of Fuzzy Pattern Trees , 2011, IEEE Transactions on Fuzzy Systems.

[18]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[20]  William W. Cohen,et al.  TensorLog: A Probabilistic Database Implemented Using Deep-Learning Infrastructure , 2020, J. Artif. Intell. Res..

[21]  Eyke Hüllermeier,et al.  Multi-target prediction: a unifying view on problems and methods , 2018, Data Mining and Knowledge Discovery.

[22]  Jesse Read,et al.  Multi-label Classification using Labels as Hidden Nodes , 2015, ArXiv.

[23]  Johannes Fürnkranz,et al.  From Local to Global Patterns: Evaluation Issues in Rule Learning Algorithms , 2004, Local Pattern Detection.

[24]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[25]  Stefan Kramer,et al.  On the spectrum between binary relevance and classifier chains in multi-label classification , 2015, SAC.

[26]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[27]  Katharina Morik,et al.  Knowledge Acquisition and Machine Learning: Theory, Methods, and Applications , 1993 .

[28]  Cynthia Rudin,et al.  A Bayesian Framework for Learning Rule Sets for Interpretable Classification , 2017, J. Mach. Learn. Res..

[29]  Richard Evans,et al.  Learning Explanatory Rules from Noisy Data , 2017, J. Artif. Intell. Res..

[30]  Wojciech Kotlowski,et al.  ENDER: a statistical framework for boosting decision rules , 2010, Data Mining and Knowledge Discovery.

[31]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[32]  Johannes Fürnkranz,et al.  Learning Structured Declarative Rule Sets - A Challenge for Deep Discrete Learning , 2020, ArXiv.

[33]  Nicu Sebe,et al.  Binary Neural Networks: A Survey , 2020, Pattern Recognit..

[34]  Eyke Hüllermeier,et al.  Learning Gradient Boosted Multi-label Classification Rules , 2020, ArXiv.

[35]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[36]  Geoff Holmes,et al.  Classifier Chains: A Review and Perspectives , 2021, J. Artif. Intell. Res..

[37]  Franco Turini,et al.  GLocalX - From Local to Global Explanations of Black Box AI Models , 2021, Artif. Intell..

[38]  Franco Turini,et al.  Local Rule-Based Explanations of Black Box Decision Systems , 2018, ArXiv.

[39]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[40]  Stephen Muggleton,et al.  Machine Invention of First Order Predicates by Inverting Resolution , 1988, ML.

[41]  E. McCluskey Minimization of Boolean functions , 1956 .

[42]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[43]  Mirko Polato,et al.  Boolean kernels for rule based interpretation of support vector machines , 2019, Neurocomputing.

[44]  Stephen Muggleton,et al.  Meta-interpretive learning of higher-order dyadic datalog: predicate invention revisited , 2013, Machine Learning.

[45]  Johannes Fürnkranz,et al.  An Investigation into Mini-Batch Rule Learning , 2021, ArXiv.

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[48]  Pankaj Mehra,et al.  Constructive Induction Framework , 1989, ML Workshop.

[49]  Andrew P. Bradley,et al.  Rule extraction from support vector machines: A review , 2010, Neurocomputing.

[50]  Masayuki Numao,et al.  Discrimination-Based Constructive Induction of Logic Programs , 1992, AAAI.

[51]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[52]  Robert K. Brayton,et al.  Logic Minimization Algorithms for VLSI Synthesis , 1984, The Kluwer International Series in Engineering and Computer Science.

[53]  Eyke Hüllermeier,et al.  On label dependence and loss minimization in multi-label classification , 2012, Machine Learning.

[54]  Donato Malerba,et al.  A Multistrategy Approach to Learning Multiple Dependent Concepts , 1996 .

[55]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[56]  Edgar Sommer Theory restructuring - a perspective on design and maintenance of knowlege based systems , 1996, DISKI.

[57]  Masanori Nakakuni,et al.  Quantitative measures to evaluate neural network weight initialization strategies , 2017, 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC).

[58]  Jude W. Shavlik,et al.  Using neural networks for data mining , 1997, Future Gener. Comput. Syst..

[59]  Eneldo Loza Mencía,et al.  DeepRED - Rule Extraction from Deep Neural Networks , 2016, DS.

[60]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[61]  Stephen Muggleton,et al.  Structuring Knowledge by Asking Questions , 1987, EWSL.

[62]  Johannes Fürnkranz,et al.  Rule-Based Multi-label Classification: Challenges and Opportunities , 2020, RuleML+RR.

[63]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[64]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[65]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[66]  Luc De Raedt,et al.  Multiple Predicate Learning , 1993, IJCAI.

[67]  Johannes Fürnkranz,et al.  Foundations of Rule Learning , 2012, Cognitive Technologies.

[68]  Pedro M. Domingos,et al.  Statistical predicate invention , 2007, ICML '07.

[69]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[70]  Jesse Read,et al.  A Deep Interpretation of Classifier Chains , 2014, IDA.