Using Instance-Level Meta-Information to Facilitate a More Principled Approach to Machine Learning

Using Instance-Level Meta-Information to Facilitate a More Principled Approach to Machine Learning Michael Reed Smith Department of Computer Science, BYU Doctor of Philosophy As the capability for capturing and storing data increases and becomes more ubiquitous, an increasing number of organizations are looking to use machine learning techniques as a means of understanding and leveraging their data. However, the success of applying machine learning techniques depends on which learning algorithm is selected, the hyperparameters that are provided to the selected learning algorithm, and the data that is supplied to the learning algorithm. Even among machine learning experts, selecting an appropriate learning algorithm, setting its associated hyperparameters, and preprocessing the data can be a challenging task and is generally left to the expertise of an experienced practitioner, intuition, trial and error, or another heuristic approach. This dissertation proposes a more principled approach to understand how the learning algorithm, hyperparameters, and data interact with each other to facilitate a data-driven approach for applying machine learning techniques. Specifically, this dissertation examines the properties of the training data and proposes techniques to integrate this information into the learning process and for preprocessing the training set. It also proposes techniques and tools to address selecting a learning algorithm and setting its hyperparameters. This dissertation is comprised of a collection of papers that address understanding the data used in machine learning and the relationship between the data, the performance of a learning algorithm, and the learning algorithms associated hyperparameter settings. Contributions of this dissertation include: • Instance hardness that examines how difficult an instance is to classify correctly. • The hardness measures that characterize properties of why an instance may be misclassified. • Several techniques for integrating instance hardness into the learning process. These techniques demonstrate the importance of considering each instance individually rather than doing a global optimization which considers all instances equally. • Large-scale examinations of the investigated techniques including a large numbers of examined data sets and learning algorithms. This provides more robust results that are less likely to be affected by noise. • The Machine Learning Results Repository, a repository for storing the results from machine learning experiments at the instance level (the prediction for each instance is stored). This allows many data set-level measures to be calculated such as accuracy, precision, or recall. These results can be used to better understand the interaction between the data, learning algorithms, and associated hyperparameters. Further, the repository is designed to be a tool for the community where data can be downloaded and uploaded to follow the development of machine learning algorithms and applications.

[1]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[2]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[3]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[4]  Chun-Yi Shi,et al.  A boosting method to detect noisy data , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[5]  Christophe G. Giraud-Carrier,et al.  A metric for unsupervised metalearning , 2011, Intell. Data Anal..

[6]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[7]  Zengyou He,et al.  Outlier Detection Integrating Semantic Knowledge , 2002, WAIM.

[8]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[9]  Barry Smyth,et al.  Modelling the Competence of Case-Bases , 1998, EWCBR.

[10]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[11]  Tin Kam Ho,et al.  On classifier domains of competence , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Michael Reed Smith,et al.  An Empirical Study of Instance Hardness , 2009 .

[13]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[14]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[15]  Kewei Tu,et al.  On the Utility of Curricula in Unsupervised Learning of Probabilistic Grammars , 2011, IJCAI.

[16]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[17]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[18]  Nabil M. Hewahi,et al.  A comparative Study of Outlier Mining and Class Outlier Mining , 2009 .

[19]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Combining meta-learning and search techniques to select parameters for support vector machines , 2012, Neurocomputing.

[20]  Taghi M. Khoshgoftaar,et al.  Rule-based noise detection for software measurement data , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[21]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[22]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[23]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[24]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[25]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[26]  Valentin I. Spitkovsky,et al.  From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing , 2010, NAACL.

[27]  ShimKyuseok,et al.  Efficient algorithms for mining outliers from large data sets , 2000 .

[28]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[29]  Robert J. McQueen,et al.  Machine Learning Applied to Fourteen Agricultural Datasets , 1996 .

[30]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[31]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[33]  Mikhail Petrovskiy,et al.  Outlier Detection Algorithms in Data Mining Systems , 2003, Programming and Computer Software.

[34]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[35]  Zengyou He,et al.  Mining Class Outliers: Concepts, Algorithms and Applications , 2004, WAIM.

[36]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[37]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[38]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[39]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[40]  Christos Faloutsos,et al.  Cross-Outlier Detection , 2003, SSTD.

[41]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[42]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[43]  Carla E. Brodley,et al.  Class Noise Mitigation Through Instance Weighting , 2007, ECML.

[44]  Enrico Blanzieri,et al.  Fast and Scalable Local Kernel Machines , 2010, J. Mach. Learn. Res..