论文信息 - Robust Order Statistics based Ensembles for Distributed Data Mining

Robust Order Statistics based Ensembles for Distributed Data Mining

Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is not practical or feasible to pool all the data in a common data warehouse before attempting to analyze it. 1 Mining of Distributed Data Sources An implicit assumption in traditional statistical pattern recognition and machine learning algorithms is that the data to be used for model development is available as a single flat file. This assumption is valid for virtually all popular benchmark datasets such as those available from ELENA, Statlog or the UCI machine learning repository. Such datasets are small or medium sized, requiring a few megabytes at most. Thus the algorithms typically also assume that the entire data can fit in main memory, and do not address computational issues regarding scalability and “out-of-core” operations. The tremendous explosion in the amount of data gathering and warehousing in the past few years has generated very large and complex databases. Any effort in mining information from such databases has to address the fact that (i) data may be kept in several files as in interlinked relational databases, and information needed for decision making may be spread over more than one file. For example, the concept of “collective data mining” [Kargupta and Park, 2000] explicitly addresses “vertical partitioning” situations where the features or variables relevant to a classification decision are spread over multiple files, each accessible to only one classifier. (ii) the files may be spread across several disks or even across different geographical locations, and (iii) the statistical quality of data may vary widely. For example the percentage of cases involving financial or health-care fraud varies in different regions, and so does the amount of missing information. One can argue that by transfering all data to a single warehouse and performing a series of merges and joins, one can get a single (albeit very large), flat file. A traditional algorithm can be used after randomizing and subsampling this file. But in real applications this approach may not be feasible because of the computational, bandwidth and storage costs. In certain cases, it may not even be possible for a variety of practical reasons including security, privacy, proprietary nature of data, need for fault tolerant distribution of data and services, real-time processing requirements, statutary constraints imposed by law, etc. [Prodromidis et al., 2000]. Then there are two options. If the owners of the individual databases are willing to provide high level or summary information/decisions such as local classification estimates, and transmit this information to a central location, then a meta-learner can be applied to the component decisions to come up with a final, composite decision. Note that such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records [DuMouchel et al., 1999]. Otherwise one has to resort to a distributed computing framework such as the emerging field of COllective INtelligence (COIN), wherein techniques are developed such that local and independent computations can still increase a desired global utility function [Wolpert and Tumer, 1999]. The first option leads to several issues reminiscent of studies in decision fusion [Dasarathy, 1994] applied largely to multi-sensor fusion and distributed control problems. It is also related to the theory of

M. Field | N. Ames

[1] A. E. Sarhan,et al. Estimation of Location and Scale Parameters by Order Statistics from Singly and Doubly Censored Samples , 1956 .

[2] N. S. Barnett,et al. Private communication , 1969 .

[3] G. Shepherd. The Synaptic Organization of the Brain , 1979 .

[4] Jeffrey A. Barnett,et al. Computational Methods for a Mathematical Theory of Evidence , 1981, IJCAI.

[5] O. G. Selfridge,et al. Pandemonium: a paradigm for learning , 1988 .

[6] Jude W. Shavlik,et al. Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[7] Bruce W. Suter,et al. The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[8] Marvin Minsky,et al. Logical Versus Analogical or Symbolic Versus Connectionist or Neat Versus Scruffy , 1991, AI Mag..

[9] Richard Lippmann,et al. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[10] L. Cooper,et al. When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[11] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.