Distributed, Collaborative Data Analysis from Heterogeneous Sites Using a Scalable Evolutionary Technique

This paper documents an early effort to develop an experimental, collaborative data analysis technique for learning classifiers from a collection of heterogeneous datasets distributed over a network. The proposed technique makes use of a scalable evolutionary algorithm, called the GEMGA to classify datasets. This paper describes the developed technique and the results of the use of this technique through the application of this system for several domains, including distributed fault detection in an electrical power distribution network.

[1]  Zoran Obradovic,et al.  Distributed clustering and local regression for knowledge discovery in multiple spatial databases , 2000, ESANN.

[2]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[3]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[4]  Deborah R. Carvalho,et al.  A Genetic Algorithm-Based Solution for the Problem of Small Disjuncts , 2000, PKDD.

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[7]  Stephen F. Smith,et al.  Competition-based induction of decision models from examples , 1993, Machine Learning.

[8]  D. Ackley A connectionist machine for genetic hillclimbing , 1987 .

[9]  K. Deb Binary and floating-point function optimization using messy genetic algorithms , 1991 .

[10]  Stephen F. Smith,et al.  A learning system based on genetic adaptive algorithms , 1980 .

[11]  Jim Smith,et al.  Recombination strategy adaptation via evolution of gene linkage , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[12]  Sanza T. Kazadi,et al.  Conjugate Schema in Genetic Search , 1997, ICGA.

[13]  Vincent Cho,et al.  Towards Real Time Discovery from Distributed Information Sources , 1998, PAKDD.

[14]  J. E. Gibson,et al.  Adaptive Learning Systems , 2017 .

[15]  C. Y. Teo Machine learning and knowledge building for fault diagnosis in distribution network , 1995 .

[16]  John G. Gammack,et al.  Searching databases using parallel genetic algorithms on a transputer computing surface , 1993, Future Gener. Comput. Syst..

[17]  Foster Provost,et al.  Distributed Data Mining: Scaling up and beyond , 2000 .

[18]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  John J. Grefenstette,et al.  Learning sequential decision rules using simulation models and competition , 2004, Machine Learning.

[20]  Ramakrishnan Srikant,et al.  The Quest Data Mining System , 1996, KDD.

[21]  Hillol Kargupta,et al.  From Gene Expression to Large Scale Evolutionary Optimization , 2000 .

[22]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[23]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[24]  Hillol Kargupta,et al.  A perspective on the foundation and evolution of the linkage learning genetic algorithms , 2000 .

[25]  R. Rosenberg Simulation of genetic populations with biochemical properties : technical report , 1967 .

[26]  Victor R. Lesser,et al.  Problem structure and subproblem sharing in multi-agent systems , 1998, Proceedings International Conference on Multi Agent Systems (Cat. No.98EX160).

[27]  Bryan Horling,et al.  A Next Generation Information Gathering Agent TITLE2 , 1998 .

[28]  Deborah R. Carvalho,et al.  A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining , 2000, GECCO.

[29]  Hillol Kargupta,et al.  The Gene Expression Messy Genetic Algorithm , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[30]  BlackboxOptimizationHillol Kargupta,et al.  SEARCH : An Alternate Perspective Toward , 1995 .

[31]  John J. Grefenstette,et al.  An Evolutionary Approach to Learning in Robots. , 1994 .

[32]  Roger L. King,et al.  Supporting Information Infrastructure for Distributed, Heterogeneous Knowledge Discovery , 1996 .

[33]  Hillol Kargupta,et al.  Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining , 2001, J. Parallel Distributed Comput..

[34]  Wai Lam,et al.  Distributed data mining of probabilistic knowledge , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[35]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[36]  David E. Goldberg,et al.  Computer-aided pipeline operation using genetic algorithms and rule learning. PART II: Rule learning control of a pipeline under normal and abnormal conditions , 1987, Engineering with Computers.

[37]  Robert L. Grossman,et al.  A Framework for Finding Distributed Data Mining Strategies That are Intermediate Between Centralized , 2000 .

[38]  Heinz Mühlenbein,et al.  Schemata, Distributions and Graphical Models in Evolutionary Optimization , 1999, J. Heuristics.

[39]  Richard J. Enbody,et al.  Further Research on Feature Selection and Classification Using Genetic Algorithms , 1993, ICGA.

[40]  H. Kargupta Search, polynomial complexity, and the fast messy genetic algorithm , 1996 .

[41]  Mehmet Sayal,et al.  A Distributed Clustering Algorithm for Web-Based Access Patterns , 2000 .

[42]  John Daniel. Bagley,et al.  The behavior of adaptive systems which employ genetic and correlation algorithms : technical report , 1967 .

[43]  Fushuan Wen,et al.  Probabilistic approach for fault-section estimation in power systems based on a refined genetic algorithm , 1997 .

[44]  James A. Momoh,et al.  An implementation of a hybrid intelligent tool for distribution system fault diagnosis , 1996 .

[45]  Xiang Ling An Architecture for Distributed Data Mining System Based on Web Services , 2004 .

[46]  Kalyanmoy Deb,et al.  Don't Worry, Be Messy , 1991, ICGA.

[47]  David E. Goldberg,et al.  Computer-aided pipeline operation using genetic algorithms and rule learning. PART I: Genetic algorithms in pipeline optimization , 1987, Engineering with Computers.

[48]  Fushuan Wen,et al.  Fuzzy logic approach in power system fault section identification , 1997 .

[49]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[50]  Willi Klösgen,et al.  Knowledge Discovery in Databases and Data Mining , 1996, ISMIS.

[51]  Hillol Kargupta,et al.  Function induction, gene expression, and evolutionary representation construction , 1999 .

[52]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[53]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[54]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[55]  Ilker Hamzaoglu,et al.  PADMA: PArallel Data Mining Agents for scalable text classification , 1997 .

[56]  Stephen F. Smith,et al.  Flexible Learning of Problem Solving Heuristics Through Adaptive Search , 1983, IJCAI.

[57]  L. Darrell Whitley,et al.  Messy Genetic Algorithms for Subset Feature Selection , 1997, ICGA.

[58]  S. Baluja,et al.  Using Optimal Dependency-Trees for Combinatorial Optimization: Learning the Structure of the Search Space , 1997 .

[59]  David E. Goldberg,et al.  Learning Linkage , 1996, FOGA.

[60]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[61]  Kalyanmoy Deb,et al.  RapidAccurate Optimization of Difficult Problems Using Fast Messy Genetic Algorithms , 1993, ICGA.

[62]  Filippo Neri,et al.  A Parallel Genetic Algorithm for Concept Learning , 1995, ICGA.

[63]  Daniel Raymond Frantz,et al.  Nonlinearities in genetic adaptive search. , 1972 .

[64]  S. McClean,et al.  Conceptual Clustering of Heterogeneous Distributed Databases , 2001 .

[65]  John J. Grefenstette,et al.  Lamarckian Learning in Multi-Agent Environments , 1991, ICGA.

[66]  Salvatore J. Stolfo,et al.  Cost Complexity-Based Pruning of Ensemble Classifiers , 2001, Knowledge and Information Systems.

[67]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[68]  Adly A. Girgis,et al.  Automated fault location and diagnosis on electric power distribution feeders , 1997 .

[69]  Dirk Thierens,et al.  Scalability Problems of Simple Genetic Algorithms , 1999, Evolutionary Computation.

[70]  Kenneth A. De Jong,et al.  Using genetic algorithms for concept learning , 1993, Machine Learning.

[71]  John J. Grefenstette,et al.  A System for Learning Control Strategies with Genetic Algorithms , 1989, ICGA.

[72]  Stephen F. Smith,et al.  Using Coverage as a Model Building Constraint in Learning Classifier Systems , 1994, Evolutionary Computation.

[73]  Filippo Menczer,et al.  Adaptive information agents in distributed textual environments , 1998, AGENTS '98.

[74]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[75]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[76]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[77]  James R. Levenick Inserting Introns Improves Genetic Algorithm Success Rate: Taking a Cue from Biology , 1991, ICGA.

[78]  John J. Grefenstette,et al.  Credit assignment in rule discovery systems based on genetic algorithms , 1988, Machine Learning.

[79]  J. David Schaffer,et al.  An Adaptive Crossover Distribution Mechanism for Genetic Algorithms , 1987, ICGA.

[80]  D. E. Goldberg,et al.  An analysis of a reordering operator on a GA-hard problem , 1990, Biological Cybernetics.

[81]  Gang Wang,et al.  Revisiting the GEMGA: scalable evolutionary optimization through linkage learning , 1998, 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360).

[82]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[83]  Kalyanmoy Deb,et al.  Messy Genetic Algorithms: Motivation, Analysis, and First Results , 1989, Complex Syst..

[84]  John J. Greffenstette,et al.  A System for Learning Control Strategies with Genetic Algorithms , 1989 .

[85]  Yike Guo,et al.  Parallel Induction Algorithms for Data Mining , 1997, IDA.

[86]  Kenji Yamanishi,et al.  Distributed cooperative Bayesian learning strategies , 1997, COLT '97.

[87]  David Wai-Lok Cheung,et al.  Efficient Mining of Association Rules in Distributed Databases , 1996, IEEE Trans. Knowl. Data Eng..

[88]  Raj Bhatnagar,et al.  Pattern Discovery in Distributed Databases , 1997, AAAI/IAAI.

[89]  Bryan Horling,et al.  A Next Generation Information Gathering Agent , 1998 .

[90]  Alexandros Moukas Amalthaea Information Discovery and Filtering Using a Multiagent Evolving Ecosystem , 1997, Appl. Artif. Intell..

[91]  Chris Nowak,et al.  Multiple Databases, Partial Reasoning, and Knowledge Discovery , 1998, PAKDD.

[92]  Wenke Lee,et al.  A Data Mining Framework for Adaptive Intrusion Detection ∗ , 1998 .

[93]  Hillol Kargupta,et al.  Collective Principal Component Analysis from Distributed, Heterogeneous Data , 2000, PKDD.

[94]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[95]  Peter Edwards,et al.  The Communication of Inductive Inferences , 1996, ECAI Workshop LDAIS / ICMAS Workshop LIOME.

[96]  G. Harik Learning gene linkage to efficiently solve problems of bounded difficulty using genetic algorithms , 1997 .

[97]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[98]  Peter Edwards,et al.  Distributed Learning: An Agent-Based Approach to Data-Mining , 1995 .

[99]  Cees H. M. van Kemenade Explicit Filtering of Building Blocks for Genetic Algorithms , 1996, PPSN.

[100]  Heinz Mühlenbein,et al.  Fuzzy Recombination for the Breeder Genetic Algorithm , 1995, ICGA.

[101]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[102]  Fernando G. Lobo,et al.  Compressed introns in a linkage learning genetic algorithm , 1998 .

[103]  Robert L. Grossman,et al.  The Preliminary Design of Papyrus: A System for High Performance Distributed Data Mining over Cluste , 1998, AAAI 1998.

[104]  John J. Grefenstette,et al.  Multilevel Credit Assignment in a Genetic Learning System , 1987, International Conference on Genetic Algorithms.

[105]  Foster J. Provost,et al.  Inductive policy: The pragmatics of bias selection , 1995, Machine Learning.

[106]  H. Muhlenbein,et al.  Gene pool recombination and utilization of covariances for the Breeder Genetic Algorithm , 1995, Proceedings of 1995 IEEE International Conference on Evolutionary Computation.

[107]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[108]  Jan Paredis,et al.  The Symbiotic Evolution of Solutions and Their Representations , 1995, International Conference on Genetic Algorithms.

[109]  Hillol Kargupta,et al.  SEARCH, Computational Processes in Evolution, and Preliminary Development of the Gene Expression Messy Genetic Algorithm , 1997, Complex Syst..

[110]  Kalyanmoy Deb,et al.  Messy Genetic Algorithms Revisited: Studies in Mixed Size and Scale , 1990, Complex Syst..

[111]  John J. Grefenstette,et al.  Learning the Persistence of Actions in Reactive Control Rules , 1991, ML.