An A-Team Approach to Learning Classifiers from Distributed Data Sources

Distributed data mining is an important research area. The task of distributed data mining is to analyse data from different sources. Solving such tasks requires special approaches and tools, different from those dedicated to analysing data located in a single database. This paper presents an approach to learning classifiers from distributed data that is based on data reduction (the prototype selection) at the local level. In such case, the aim of data reduction is to obtain a compact representation of distributed data repositories that include non-redundant information in the form of so-called prototypes. The approach has been implemented using the JABAT environment, which, in turn, is an implementation of the A-Team concept. The paper includes a general overview of JABAT, the problem formulation and a description of the proposed solution in which the global classifier is induced from prototypes that are selected from distributed datasets in the process of data reduction at the local level. Finally, computational experiment results validating the approach are shown. The experiment results indicate that proposed classifier can produce very good classification results.

[1]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[2]  Xiaofeng Zhang,et al.  Mining Local Data Sources For Learning Global Cluster Models , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[5]  Mehmet Emin Aydin,et al.  Teams of autonomous agents for job-shop scheduling problems: An experimental study , 2004, J. Intell. Manuf..

[6]  Philip K. Chan,et al.  Meta-learning in distributed data mining systems: Issues and approaches , 2007 .

[7]  Ladislau Bölöni,et al.  A component-based architecture for problem solving environments , 2000 .

[8]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[9]  Zoran Obradovic,et al.  Data Reduction Using Multiple Models Integration , 2001, PKDD.

[10]  Piotr Jedrzejowicz,et al.  JADE-Based A-Team Environment , 2006, International Conference on Computational Science.

[11]  Piotr Jedrzejowicz,et al.  An Approach to Instance Reduction in Supervised Learning , 2003, SGAI Conf..

[12]  Lefteris Angelis,et al.  Clustering classifiers for knowledge discovery from physically distributed databases , 2004, Data Knowl. Eng..

[13]  H. Sivakumar,et al.  Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[15]  Salvatore J. Stolfo,et al.  The application of AdaBoost for distributed, scalable and on-line learning , 1999, KDD '99.

[16]  Kai Ming Ting,et al.  Model Combination in the Multiple-Data-Batches Scenario , 1997, ECML.

[17]  Piotr Jedrzejowicz,et al.  An Approach to Data Reduction and Integrated Machine Classification , 2010, New Generation Computing.

[18]  James Morgan,et al.  SAMPLE SIZE AND MODELING ACCURACY OF DECISION TREE BASED DATA MINING TOOLS , 2003 .

[19]  Byung-Hoon Park,et al.  Collective Data Mining: A New Perspective Toward Distributed Data Analysis , 1999 .

[20]  Piotr Jędrzejowicz,et al.  Instance reduction approach to machine learning and multi-database mining , 2006, Ann. UMCS Informatica.

[21]  H. Van Dyke Parunak Agents in Overalls: Experiences and Issues in the Development and Deployment of Industrial Agent-Based Systems , 2000, Int. J. Cooperative Inf. Syst..

[22]  Hongjun Lu,et al.  Identifying Relevant Databases for Multidatabase Mining , 1998, PAKDD.

[23]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[24]  Roberto Battiti,et al.  Democracy in neural nets: Voting schemes for classification , 1994, Neural Networks.

[25]  Piotr Jedrzejowicz,et al.  An Agent-Based Algorithm for Data Reduction , 2007, SGAI Conf..

[26]  Zoran Obradovic,et al.  The distributed boosting algorithm , 2001, KDD '01.

[27]  Piotr Jedrzejowicz,et al.  An A-Team Approach to Learning Classifiers from Distributed Data Sources , 2008, KES-AMSTA.

[28]  Piotr Jędrzejowicz,et al.  Social learning algorithm as a tool for solving some difficult scheduling problems , 1999 .

[29]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[30]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.