Using Meta-mining to Support Data Mining Workflow Planning and Optimization

Knowledge Discovery in Databases is a complex process that involves many different data processing and learning operators. Today's Knowledge Discovery Support Systems can contain several hundred operators. A major challenge is to assist the user in designing workflows which are not only valid but also - ideally - optimize some performance measure associated with the user goal. In this paper we present such a system. The system relies on a meta-mining module which analyses past data mining experiments and extracts meta-mining models which associate dataset characteristics with workflow descriptors in view of workflow performance optimization. The meta-mining model is used within a data mining workflow planner, to guide the planner during the workflow planning. We learn the meta-mining models using a similarity learning approach, and extract the workflow descriptors by mining the workflows for generalized relational patterns accounting also for domain knowledge provided by a data mining ontology. We evaluate the quality of the data mining workflows that the system produces on a collection of real world datasets coming from biology and show that it produces workflows that are significantly better than alternative methods that can only do workflow selection and not planning.

[1]  Pavel Brazdil,et al.  Active Testing Strategy to Predict the Best Classification Algorithm via Sampling and Metalearning , 2010, ECAI.

[2]  Alexandros Kalousis,et al.  Algorithm selection via meta-learning , 2002 .

[3]  João Gama,et al.  On Data and Algorithms: Understanding Inductive Performance , 2004, Machine Learning.

[4]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[5]  Abraham Bernstein,et al.  Towards cooperative planning of data mining workflows , 2009 .

[6]  Melanie Hilario,et al.  Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining , 2012, 2012 IEEE 12th International Conference on Data Mining.

[7]  Jeffrey M. Forbes,et al.  Practical reinforcement learning in continuous domains , 2000 .

[8]  M. Hilario,et al.  A Data Mining Ontology for Algorithm Selection and Meta-Mining , 2009 .

[9]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[10]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[12]  Abraham Bernstein,et al.  Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Melanie Hilario Model Complexity and Algorithm Selection in Classification , 2002, Discovery Science.

[14]  Alexandros Kalousis,et al.  NOEMON: Design, implementation and performance results of an intelligent assistant for classifier selection , 1999, Intell. Data Anal..

[15]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[16]  Nada Lavrac,et al.  Automating Knowledge Discovery Workflow Composition Through Ontology-Based Planning , 2011, IEEE Transactions on Automation Science and Engineering.

[17]  Wil M. P. van der Aalst,et al.  Abstractions in Process Mining: A Taxonomy of Patterns , 2009, BPM.

[18]  Peter A. Flach,et al.  Improved Dataset Characterisation for Meta-learning , 2002, Discovery Science.

[19]  Björn Bringmann,et al.  Matching in frequent tree discovery , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[20]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[21]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[22]  Craig A. Knoblock,et al.  PDDL-the planning domain definition language , 1998 .

[23]  Charles C. Taylor,et al.  Meta-Analysis: From Data Characterisation for Meta-Learning to Meta-Regression , 2000 .

[24]  Wil M. P. van der Aalst,et al.  Finding Structure in Unstructured Processes: The Case for Process Mining , 2007, Seventh International Conference on Application of Concurrency to System Design (ACSD 2007).

[25]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[26]  T. Ho,et al.  Data Complexity in Pattern Recognition , 2006 .

[27]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[28]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[29]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[31]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[32]  Melanie Hilario,et al.  Ontology-Based Meta-Mining of Knowledge Discovery Workflows , 2011, Meta-Learning in Computational Intelligence.

[33]  Carlos Soares,et al.  Zoomed Ranking: Selection of Classification Algorithms Based on Relevant Performance Information , 2000, PKDD.

[34]  Jörg Hoffmann,et al.  FF: The Fast-Forward Planning System , 2001, AI Mag..

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Abraham Bernstein,et al.  Designing KDD-Workflows via HTN-Planning for Intelligent Discovery Assistance , 2012, KDD 2012.

[37]  Melanie Hilario,et al.  Experimental Evaluation of the e-LICO Meta-Miner ( Extended Abstract ) , 2012 .

[38]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[39]  Melanie Hilario,et al.  Fusion of Meta-knowledge and Meta-data for Case-Based Model Selection , 2001, PKDD.

[40]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.