A distributed learning framework for heterogeneous data sources

We present a probabilistic model-based framework for distributed learning that takes into account privacy restrictions and is applicable to scenarios where the different sites have diverse, possibly overlapping subsets of features. Our framework decouples data privacy issues from knowledge integration issues by requiring the individual sites to share only privacy-safe probabilistic models of the local data, which are then integrated to obtain a global probabilistic model based on the union of the features available at all the sites. We provide a mathematical formulation of the model integration problem using the maximum likelihood and maximum entropy principles and describe iterative algorithms that are guaranteed to converge to the optimal solution. For certain commonly occurring special cases involving hierarchically ordered feature sets or conditional independence, we obtain closed form solutions and use these to propose an efficient alternative scheme by recursive decomposition of the model integration problem. To address interpretability concerns, we also present a modified formulation where the global model is assumed to belong to a specified parametric family. Finally, to highlight the generality of our framework, we provide empirical results for various learning tasks such as clustering and classification on different kinds of datasets consisting of continuous vector, categorical and directional attributes. The results show that high quality global models can be obtained without much loss of privacy.

[1]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[2]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[3]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[4]  L. Shepp,et al.  Maximum Likelihood Reconstruction for Emission Tomography , 1983, IEEE Transactions on Medical Imaging.

[5]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[6]  Joydeep Ghosh,et al.  A privacy-sensitive approach to distributed clustering , 2005, Pattern Recognit. Lett..

[7]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[8]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[9]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[10]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[11]  Y. Censor,et al.  Parallel Optimization:theory , 1997 .

[12]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[13]  Shai Ben-David,et al.  A theoretical framework for learning from a pool of disparate data sources , 2002, KDD.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Parallel Distributed Comput. Pract..

[16]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[17]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[18]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[19]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.