A Knowledge Discovery System with Support for Model Selection and Visualization

The process of knowledge discovery in databases consists of several steps that are iterative and interactive. In each application, to go through this process the user has to exploit different algorithms and their settings that usually yield multiple models. Model selection, that is, the selection of appropriate models or algorithms to achieve such models, requires meta-knowledge of algorithm/model and model performance metrics. Therefore, model selection is usually a difficult task for the user. We believe that simplifying the process of model selection for the user is crucial to the success of real-life knowledge discovery activities. As opposed to most related work that aims to automate model selection, in our view model selection is a semiautomatic process, requiring an effective collaboration between the user and the discovery system. For such a collaboration, our solution is to give the user the ability to try various alternatives and to compare competing models quantitatively by performance metrics, and qualitatively by effective visualization. This paper presents our research on model selection and visualization in the development of a knowledge discovery system called D2MS. The paper addresses the motivation of model selection in knowledge discovery and related work, gives an overview of D2MS, and describes its solution to model selection and visualization. It then presents the usefulness of D2MS model selection in two case studies of discovering medical knowledge in hospital data—on meningitis and stomach cancer—using three data mining methods of decision trees, conceptual clustering, and rule induction.

[1]  Ramana Rao,et al.  The Hyperbolic Browser: A Focus + Context Technique for Visualizing Large Hierarchies , 1996, J. Vis. Lang. Comput..

[2]  Alexander Schnabl,et al.  Development of Multi-Criteria Metrics for Evaluation of Data Mining Algorithms , 1997, KDD.

[3]  Tu Bao Ho,et al.  A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods , 2000, PKDD.

[4]  William A. Wallace,et al.  Visualization and the process of modeling: a cognitive-theoretic view , 2000, KDD '00.

[5]  John Mingers,et al.  An empirical comparison of selection measures for decision-tree induction , 2004, Machine Learning.

[6]  David H. Wolpert,et al.  The Relationship Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework , 1995 .

[7]  Yoshikazu Fujikawa,et al.  Cluster-based Algorithms for Filling Missing Values , 2003 .

[8]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[9]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[10]  M. Forster,et al.  Key Concepts in Model Selection: Performance and Generalizability. , 2000, Journal of mathematical psychology.

[11]  Aleksander Øhrn ROSETTA Technical Reference Manual , 2001 .

[12]  Ben Shneiderman,et al.  Browsing hierarchical data with multi-level dynamic queries and pruning , 1997, Int. J. Hum. Comput. Stud..

[13]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Heikki Mannila,et al.  Methods and Problems in Data Mining , 1997, ICDT.

[16]  Jock D. Mackinlay,et al.  Cone Trees: animated 3D visualizations of hierarchical information , 1991, CHI.

[17]  George Furnas,et al.  The FISHEYE view: A new look at structured files , 1986, CHI 1986.

[18]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  John Mingers,et al.  An Empirical Comparison of Pruning Methods for Decision Tree Induction , 1989, Machine Learning.

[21]  Hans-Peter Kriegel,et al.  Towards an Effective Cooperation of the Computer and the User for Classification , 2000, KDD 2000.

[22]  M. Hilario,et al.  Building algorithm profiles for prior model selection in knowledge discovery systems , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[23]  Tu Bao Ho,et al.  Knowledge Discovery from Unsupervised Data in Support of Decision Making , 2000 .

[24]  Ron Kohavi,et al.  MineSet: An Integrated System for Data Mining , 1997, KDD.

[25]  Ehud Gudes,et al.  FlexiMine - A Flexible Platform for KDD Research and Application Construction , 1998, KDD.

[26]  Nick Cercone,et al.  RuleViz: a model for visualizing knowledge discovery process , 2000, KDD '00.

[27]  Edward M. Reingold,et al.  Tidier Drawings of Trees , 1981, IEEE Transactions on Software Engineering.

[28]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[29]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[30]  Tu Bao Ho,et al.  A Scalable Algorithm for Rule Post-pruning of Large Decision Trees , 2001, PAKDD.

[31]  Alexandros Kalousis,et al.  NOEMON: Design, implementation and performance results of an intelligent assistant for classifier selection , 1999, Intell. Data Anal..

[32]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[33]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[34]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[35]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[36]  Ronald J. Brachman,et al.  The Process of Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[37]  Tu Bao Ho,et al.  An Interactive-Graphic System for Decision Tree Induction , 1999 .

[38]  Zucchini,et al.  An Introduction to Model Selection. , 2000, Journal of mathematical psychology.

[39]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[40]  Tu Bao Ho,et al.  A visualization tool for interactive learning of large decision trees , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[41]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[42]  Tu Bao Ho,et al.  Discovering and using knowledge from unsupervised data , 1997, Decis. Support Syst..

[43]  T. B. Ho,et al.  Extracting Meningitis Knowledge by Integration of Rule Induction and Association Mining , 2001, JSAI Workshops.

[44]  Carla E. Brodley,et al.  Recursive automatic bias selection for classifier construction , 1995, Machine Learning.