Distributed learning using generative models
暂无分享,去创建一个
Distributed learning methods that involve extracting useful information from large, distributed repositories are becoming important for an increasing number of applications such as inter-enterprise data mining, sensor learning, etc. Effective deployment of these methods, however, requires addressing a number of practical challenges arising from privacy, proprietary and bandwidth restrictions while ensuring applicability to a broad range of data types and data partitioning scenarios. In this thesis, we present a probabilistic model-based framework for distributed learning that addresses most of these real-world issues.
Our distributed learning framework decouples the data privacy issues from the knowledge integration issues by following a "divide and conquer" strategy wherein the overall learning process is divided into two sub-tasks: (i) local learning, which involves training privacy-safe probabilistic models based on the local data, and (ii) model integration, where the local models are transmitted to a central combiner, which integrates them to obtain a global model based on the union of the features available at all the sites. To formalize and address these two tasks, we adopt a solution strategy that comprises of three critical components.
First, we quantify the privacy of a data entity with respect to a probabilistic model using information-theoretic ideas so as to permit formal specification of the privacy restrictions in the local learning task. This definition is quite general and applicable to any privacy-preserving setting where privacy is perceived in terms of the uncertainty in predicting a data entity.
Then, we consider the problem of local learning based on generative models for two important tasks, namely clustering and co-clustering. For the first task, we focus on the recently proposed Bregman clustering framework [Ban05, BMDG05], which is applicable to a large class of distance measures called Bregman divergences, and propose a dual formulation and two clustering algorithms that are more suitable for privacy-sensitive learning. The second task, co-clustering, refers to the simultaneous clustering of rows and columns of a data matrix. For this task, we propose a fairly general framework that extends existing work on co-clustering by (i) allowing loss functions corresponding to all Bregman divergences, and (ii) permitting various conditional expectation-based constraints depending on the statistics of the data that need to be preserved. To accomplish the above generalizations, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the well-known maximum entropy and standard least squares principles. The proposed framework expands the applicability of generative learning to a number of interesting real-world domains involving large, sparse matrices such as text and micro-array analysis.
Finally, we address the model integration problem for both homogeneous and heterogeneous data sources. Using the maximum likelihood and maximum entropy principles, we provide a precise formulation of the model integration problem and provide efficient solutions for both discrete and continuous domains. We also present specialized solutions for certain common configurations involving hierarchical ordering and conditional independence constraints.
To highlight the generality of our framework, we provide empirical results for a variety of learning tasks such as clustering, classification and semi-supervised learning on different types of datasets consisting of interval, categorical and directional attributes, which demonstrate that the proposed algorithms can provide high quality results without much loss of privacy.