Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In the typical setting investigated till now, each classifier is trained on data taken or resampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature, quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance. In this chapter we introduce and investigate a family of meta-classifiers based on order statistics, for robust handling of such cases. Based on a mathematical modeling of how the decision boundaries are affected by order statistic combiners, we derive expressions for the reductions in error expected when such combiners are used. We show analytically that the selection of the median, the maximum and in general, the ith order statistic improves classification performance. Furthermore, we introduce the trim and spread combiners, both based on linear combinations of the ordered classifier outputs, and empirically show that they are significantly superior in the presence of outliers or uneven classifier performance. So they can be fruitfully applied to several heterogeneous distributed data mining situations, specially when it is not practical or feasible to pool all the data in a common data warehouse before attempting to analyze it. 1 Mining of Distributed Data Sources An implicit assumption in traditional statistical pattern recognition and machine learning algorithms is that the data to be used for model development is available as a single flat file. This assumption is valid for virtually all popular benchmark datasets such as those available from ELENA, Statlog or the UCI machine learning repository. Such datasets are small or medium sized, requiring a few megabytes at most. Thus the algorithms typically also assume that the entire data can fit in main memory, and do not address computational issues regarding scalability and “out-of-core” operations. The tremendous explosion in the amount of data gathering and warehousing in the past few years has generated very large and complex databases. Any effort in mining information from such databases has to address the fact that (i) data may be kept in several files as in interlinked relational databases, and information needed for decision making may be spread over more than one file. For example, the concept of “collective data mining” [Kargupta and Park, 2000] explicitly addresses “vertical partitioning” situations where the features or variables relevant to a classification decision are spread over multiple files, each accessible to only one classifier. (ii) the files may be spread across several disks or even across different geographical locations, and (iii) the statistical quality of data may vary widely. For example the percentage of cases involving financial or health-care fraud varies in different regions, and so does the amount of missing information. One can argue that by transfering all data to a single warehouse and performing a series of merges and joins, one can get a single (albeit very large), flat file. A traditional algorithm can be used after randomizing and subsampling this file. But in real applications this approach may not be feasible because of the computational, bandwidth and storage costs. In certain cases, it may not even be possible for a variety of practical reasons including security, privacy, proprietary nature of data, need for fault tolerant distribution of data and services, real-time processing requirements, statutary constraints imposed by law, etc. [Prodromidis et al., 2000]. Then there are two options. If the owners of the individual databases are willing to provide high level or summary information/decisions such as local classification estimates, and transmit this information to a central location, then a meta-learner can be applied to the component decisions to come up with a final, composite decision. Note that such high level information not only has reduced storage and bandwidth requirements, but also maintains the privacy of individual records [DuMouchel et al., 1999]. Otherwise one has to resort to a distributed computing framework such as the emerging field of COllective INtelligence (COIN), wherein techniques are developed such that local and independent computations can still increase a desired global utility function [Wolpert and Tumer, 1999]. The first option leads to several issues reminiscent of studies in decision fusion [Dasarathy, 1994] applied largely to multi-sensor fusion and distributed control problems. It is also related to the theory of
[1]
A. E. Sarhan,et al.
Estimation of Location and Scale Parameters by Order Statistics from Singly and Doubly Censored Samples
,
1956
.
[2]
N. S. Barnett,et al.
Private communication
,
1969
.
[3]
G. Shepherd.
The Synaptic Organization of the Brain
,
1979
.
[4]
Jeffrey A. Barnett,et al.
Computational Methods for a Mathematical Theory of Evidence
,
1981,
IJCAI.
[5]
O. G. Selfridge,et al.
Pandemonium: a paradigm for learning
,
1988
.
[6]
Jude W. Shavlik,et al.
Training Knowledge-Based Neural Networks to Recognize Genes
,
1990,
NIPS.
[7]
Bruce W. Suter,et al.
The multilayer perceptron as an approximation to a Bayes optimal discriminant function
,
1990,
IEEE Trans. Neural Networks.
[8]
Marvin Minsky,et al.
Logical Versus Analogical or Symbolic Versus Connectionist or Neat Versus Scruffy
,
1991,
AI Mag..
[9]
Richard Lippmann,et al.
Neural Network Classifiers Estimate Bayesian a posteriori Probabilities
,
1991,
Neural Computation.
[10]
L. Cooper,et al.
When Networks Disagree: Ensemble Methods for Hybrid Neural Networks
,
1992
.
[11]
David H. Wolpert,et al.
Stacked generalization
,
1992,
Neural Networks.
[12]
Elie Bienenstock,et al.
Neural Networks and the Bias/Variance Dilemma
,
1992,
Neural Computation.
[13]
Joydeep Ghosh,et al.
A neural network based hybrid system for detection, characterization, and classification of short-duration oceanic signals
,
1992
.
[14]
Sherif Hashem Bruce Schmeiser.
Approximating a Function and its Derivatives Using MSE-Optimal Linear Combinations of Trained Feedfo
,
1993
.
[15]
D. Farnsworth.
A First Course in Order Statistics
,
1993
.
[16]
M. Perrone.
Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization
,
1993
.
[17]
Harris Drucker,et al.
Boosting and Other Ensemble Methods
,
1994,
Neural Computation.
[18]
Belur V. Dasarathy,et al.
Decision fusion
,
1994
.
[19]
Anders Krogh,et al.
Neural Network Ensembles, Cross Validation, and Active Learning
,
1994,
NIPS.
[20]
Lutz Prechelt,et al.
PROBEN 1 - a set of benchmarks and benchmarking rules for neural network training algorithms
,
1994
.
[21]
Kumpati S. Narendra,et al.
Adaptation and learning using multiple models, switching, and tuning
,
1995
.
[22]
J. Aggarwal,et al.
A Comparative Study of Three Paradigms for Object Recognition -bayesian Statistics, Neural Networks and Expert Systems
,
1996
.
[23]
Kagan Tumer,et al.
Analysis of decision boundaries in linearly combined neural classifiers
,
1996,
Pattern Recognit..
[24]
Narendra Ahuja,et al.
Advances in Image Understanding: A Festschrift for Azriel Rosenfeld
,
1996
.
[25]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.
[26]
Michael J. Pazzani,et al.
Combining Neural Network Regression Estimates with Regularized Linear Weights
,
1996,
NIPS.
[27]
Kagan Tumer,et al.
Error Correlation and Error Reduction in Ensemble Classifiers
,
1996,
Connect. Sci..
[28]
F. Provost.
A Survey of Methods for Scaling Up Inductive Learning Algorithms
,
1997
.
[29]
Joydeep Ghosh,et al.
Hybrid intelligent architecture and its application to water reservoir control
,
1997
.
[30]
Yoav Freund,et al.
Boosting the margin: A new explanation for the effectiveness of voting methods
,
1997,
ICML.
[31]
Erkki Oja,et al.
Neural and statistical classifiers-taxonomy and two case studies
,
1997,
IEEE Trans. Neural Networks.
[32]
Paul S. Bradley,et al.
Refining Initial Points for K-Means Clustering
,
1998,
ICML.
[33]
John Shawe-Taylor,et al.
Generalization Performance of Support Vector Machines and Other Pattern Classifiers
,
1999
.
[34]
Tim Oates,et al.
Large Datasets Lead to Overly Complex Models: An Explanation and a Solution
,
1998,
KDD.
[35]
D. Obradovic,et al.
Combining Artificial Neural Nets
,
1999,
Perspectives in Neural Computing.
[36]
Kagan Tumer,et al.
Linear and Order Statistics Combiners for Pattern Classification
,
1999,
ArXiv.
[37]
Joydeep Ghosh,et al.
Structurally adaptive modular networks for nonstationary environments
,
1999,
IEEE Trans. Neural Networks.
[38]
J. Ross Quinlan,et al.
Simplifying decision trees
,
1987,
Int. J. Hum. Comput. Stud..
[39]
Theodore Johnson,et al.
Squashing flat files flatter
,
1999,
KDD '99.
[40]
Daryl E. Hershberger,et al.
Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book
,
1999
.
[41]
Foster Provost,et al.
Distributed Data Mining: Scaling up and beyond
,
2000
.
[42]
Leo Breiman,et al.
Bagging Predictors
,
1996,
Machine Learning.