Cluster-based instance selection for machine classification

Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.

[1]  Agostino Poggi,et al.  Developing Multi-agent Systems with JADE , 2007, ATAL.

[2]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[3]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[4]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[7]  Xindong Wu,et al.  Scalable Representative Instance Selection and Ranking , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[8]  Tony R. Martinez,et al.  An Integrated Instance‐Based Learning Algorithm , 2000, Comput. Intell..

[9]  Piotr Jedrzejowicz,et al.  Distributed Learning Algorithm based on Data Reduction , 2009, ICAART.

[10]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[11]  Hongjun Lu,et al.  Identifying Relevant Databases for Multidatabase Mining , 1998, PAKDD.

[12]  Winton Davies,et al.  DAGGER: Using Instance Selection to Combine Multiple Models Learned from Disjoint Subsets , 2001 .

[13]  Hans-Peter Kriegel,et al.  Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach* , 2003, Knowledge and Information Systems.

[14]  Nathalie Japkowicz,et al.  Boosting Support Vector Machines for Imbalanced Data Sets , 2008, ISMIS.

[15]  Huan Liu,et al.  Sampling: Knowing Whole from Its Part , 2001 .

[16]  Shaul Markovitch,et al.  The COMPSET Algorithm for Subset Selection , 2005, IJCAI.

[17]  Piotr Jedrzejowicz,et al.  An Approach to Instance Reduction in Supervised Learning , 2003, SGAI Conf..

[18]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[19]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[20]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[21]  Seong-Whan Lee,et al.  LVQ combined with simulated annealing for optimal design of large-set reference models , 1996, Neural Networks.

[22]  James Morgan,et al.  SAMPLE SIZE AND MODELING ACCURACY OF DECISION TREE BASED DATA MINING TOOLS , 2003 .

[23]  Frans Coenen,et al.  Research and Development in Intelligent Systems XX , 2004, Springer London.

[24]  Sarosh Talukdar,et al.  Asynchronous Procedures for Parallel Processing , 1983, IEEE Transactions on Power Apparatus and Systems.

[25]  Piotr Jędrzejowicz,et al.  Social learning algorithm as a tool for solving some difficult scheduling problems , 1999 .

[26]  Zoran Obradovic,et al.  The distributed boosting algorithm , 2001, KDD '01.

[27]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[28]  Piotr Jedrzejowicz,et al.  An A-Team Approach to Learning Classifiers from Distributed Data Sources , 2008, KES-AMSTA.

[29]  Marek Grochowski,et al.  Instances Selection Algorithms in the Conjunction with LVQ , 2005, Artificial Intelligence and Applications.

[30]  Venu Govindaraju,et al.  Improvements in K-Nearest Neighbor Classification , 2001, ICAPR.

[31]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[32]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[33]  Thomas R. Ioerger,et al.  Enhancing Learning using Feature and Example selection , 2003 .

[34]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[35]  Loris Nanni,et al.  Particle swarm optimization for prototype reduction , 2009, Neurocomputing.

[36]  D. Wolpert The Supervised Learning No-Free-Lunch Theorems , 2002 .

[37]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[38]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[39]  James C. Bezdek,et al.  Nearest prototype classification: clustering, genetic algorithms, or random search? , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[40]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[41]  Miroslav Kubat,et al.  Selecting representative examples and attributes by a genetic algorithm , 2003, Intell. Data Anal..

[42]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[43]  Wlodzislaw Duch,et al.  SBL-PM: A Simple Algorithm for Selection of Reference Instances for Similarity Based Methods , 2000, Intelligent Information Systems.

[44]  Mia Hubert,et al.  Clustering in an object-oriented environment , 1997 .

[45]  Edward A. Fox,et al.  Clustering for Data Reduction: A Divide and Conquer Approach , 2007 .

[46]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[47]  T. Wieczorek PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES , 2005 .

[48]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[49]  Takeaki Uno Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data , 2009, Knowledge and Information Systems.

[50]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[51]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[52]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[53]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.