Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining

This paper presents a method for distributed multivariate regression using wavelet-based collective data mining (CDM). The method seamlessly blends machine learning and the theory of communication with the statistical methods employed in parametric multivariate regression to provide an effective data mining technique for use in a distributed data and computation environment. The technique is applied to two benchmark data sets, producing results that are consistent with those obtained by applying standard parametric regression techniques to centralized data sets. Evaluation of the method in terms of mode accuracy as a function of appropriateness of the selected wavelet function, relative number of nonlinear cross-terms, and sample size demonstrates that accurate parametric multivariate regression models can be generated from distributed, heterogeneous, data sets with minimal data communication overhead compared to that required to aggregate a distributed data set. Application of this method to linear discriminant analysis, which is related to parametric multivariate regression, produced classification results on the Iris data set that are comparable to those obtained with centralized data analysis.

[1]  Daniel Thalmann,et al.  An Indexed Bibliography on Computer Animation , 1985, IEEE Computer Graphics and Applications.

[2]  H. A. Luther,et al.  Applied numerical methods , 1969 .

[3]  Vincent Cho,et al.  Towards Real Time Discovery from Distributed Information Sources , 1998, PAKDD.

[4]  E. J. Stollnitz,et al.  Wavelets for Computer Graphics: A Primer Part 2 , 1995 .

[5]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[6]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[7]  Ilker Hamzaoglu,et al.  PADMA: PArallel Data Mining Agents for scalable text classification , 1997 .

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Barbara Hubbard,et al.  The World According to Wavelets , 1996 .

[10]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[11]  Kenji Yamanishi,et al.  Distributed cooperative Bayesian learning strategies , 1997, COLT '97.

[12]  David Wai-Lok Cheung,et al.  Efficient Mining of Association Rules in Distributed Databases , 1996, IEEE Trans. Knowl. Data Eng..

[13]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[14]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  D. Hawkins Multivariate Statistics: A Practical Approach , 1990 .

[17]  Henning F. Harmuth,et al.  Transmission of information by orthogonal functions , 1969 .

[18]  F. Provost A Survey of Methods for Scaling Up Inductive Learning Algorithms , 1997 .

[19]  E. J. Stollnitz,et al.  Wavelets for Computer Graphics : A Primer , 1994 .

[20]  J. J. Freeman Experiments in discrimination and classification , 1969, Pattern Recognit..

[21]  David Salesin,et al.  Wavelets for computer graphics: a primer.1 , 1995, IEEE Computer Graphics and Applications.

[22]  Robert L. Grossman,et al.  The Preliminary Design of Papyrus: A System for High Performance Distributed Data Mining over Cluste , 1998, AAAI 1998.

[23]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[24]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[25]  Bruno Torrésani,et al.  Practical Time-Frequency Analysis, Volume 9: Gabor and Wavelet Transforms, with an Implementation in S , 1998 .

[26]  Ilker Hamzaoglu,et al.  Scalable, Distributed Data Mining - An Agent Architecture , 1997, KDD.

[27]  James W. Longley An Appraisal of Least Squares Programs for the Electronic Computer from the Point of View of the User , 1967 .

[28]  Wenke Lee,et al.  A Data Mining Framework for Adaptive Intrusion Detection ∗ , 1998 .

[29]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[30]  I. Hamzaoglu H. Kargupta,et al.  Distributed Data Mining Using An Agent Based Architecture , 1997, KDD 1997.

[31]  H. Riedwyl,et al.  Multivariate Statistics: A Practical Approach , 1988 .

[32]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[33]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[34]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[35]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[36]  Wai Lam,et al.  Distributed data mining of probabilistic knowledge , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[37]  M. Victor Wickerhauser,et al.  Adapted wavelet analysis from theory to software , 1994 .

[38]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[39]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[40]  Hillol Kargupta,et al.  Collective Principal Component Analysis from Distributed, Heterogeneous Data , 2000, PKDD.

[41]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[42]  Bruno Torrésani,et al.  Practical Time-Frequency Analysis , 1998 .

[43]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).