Learning Classifiers from Distributed Data Sources

Many application domains have witnessed, in recent years, an exponential increase in the number, size and diversity of autonomous data sources. The large size and the distributed nature of the data, storage and bandwidth considerations, and in some instances, privacy considerations prevent centralized access to data that is typically assumed in standard machine learning approaches to knowledge acquisition. Hence, ransforming this explosive increase in data into commensurate increase in knowledge call for scalable and cost-effective approaches for building predictive models from distributed data. This chapter summarizes a sufficient-statistics based approach to learning a broad class of classifiers (including naïve Bayes and decision tree classifiers) from distributed data. This approach has been shown to yield results that are identical to those obtainable in settings where the learning algorithm has centralized access to all of the data. It can also be extended to settings where the distributed data sources differ from each other with respect to their structure (schema) and semantics e.g., choice of attribute names, values, and granularity of data descriptions. INTRODUCTION Recent development of high throughput data acquisition technologies in a number of domains (e.g., biological sciences, atmospheric sciences, space sciences, commerce) together with advances in digital storage, computing, and communications technologies have resulted in the proliferation of a multitude of physically distributed data repositories created and maintained by autonomous entities (e.g., scientists, organizations). The resulting increasingly data rich domains offer unprecedented opportunities in computer assisted data-driven knowledge acquisition in a number of applications including in particular, data-driven scientific discovery (e.g., characterization of protein sequence-structure-function relationships in computational molecular biology), data-driven decision making in business and commerce, monitoring and control of complex systems (e.g., load forecasting in electric power networks), and security informatics (discovery of and countermeasures against attacks on critical information and communication infrastructures). Machine learning (Mitchell, 1997; Duda et al., 2000), offers one of the most cost-effective approaches to analyzing, exploring and extracting knowledge (features, correlations, and other complex relationships and hypotheses that describe potentially interesting regularities) from data. However, the applicability of current approaches to machine learning in emerging data rich applications is severely limited by a number of factors: a. Data repositories are large in size, dynamic, and physically distributed. Consequently, it is neither desirable nor feasible to gather all of the data in a centralized location for analysis. Hence, there is a need for efficient algorithms for analyzing and exploring multiple distributed data sources without transmitting large amounts of data. b. Autonomously developed and operated data sources often differ in their structure and organization (e.g., relational databases, flat files, etc.) and the operations that can be performed on the data sources (e.g., types of queries relational queries, statistical queries, keyword matches). Hence, there is a need for theoretically well-founded strategies for efficiently obtaining the information needed for analysis within the operational constraints imposed by the data sources. The purpose of this entry is to precisely define the problem of learning classifiers from distributed data and summarize recent advances that have led to a solution to this problem (Caragea et al., 2004; 2005). BACKGROUND: PROBLEM SPECIFICATION Given a data set D, a hypothesis class H, and a performance criterion P, an algorithm L for learning (from centralized data D) outputs a hypothesis h ∈ H that optimizes P. In pattern classification applications, h is a classifier (e.g., a decision tree, a support vector machine, etc.) (See Figure 1). The data D typically consists of a set of training examples. Each training example is an ordered tuple of attribute values, where one of the attributes corresponds to a class label and the remaining attributes represent inputs to the classifier. The goal of learning is to produce a hypothesis that optimizes the performance criterion (e.g., minimizing classification error on the training data) and the complexity of the hypothesis. In a distributed setting, a data set D is distributed among the sites 1,...,n containing the data set fragments D1,...,Dn. Two common types of data fragmentation are: horizontal fragmentation and vertical fragmentation. In the case of horizontal fragmentation, each site contains a subset of the data tuples that make up D, i.e.,  n i i D D 1 = = . In the case of vertical fragmentation each site stores the subtuples of data tuples (corresponding to a subset of the attributes used to define data tuples Data D Learner L Figure 1: Learning from centralized data h in D). In this case, D can be constructed by taking the join of the individual data sets D1,...,Dn (assuming a unique identifier for each data tuple is stored with the corresponding subtuples). More generally, the data may be fragmented into a set of relations (as in the case of tables of a relational database, but distributed across multiple sites) i.e., i n i D D 1 = ⊗ = (where ⊗ denotes the join operation). If a data set D is distributed among the sites 1,...,n containing data set fragments D1,..., Dn, we assume that the individual data sets D1,...,Dn collectively contain (in principle) all the information needed to construct the dataset D. More generally, D may be fragmented across multiple relations (Ozsu & Valduriez, 1999; Friedman et al., 1999). The distributed setting typically imposes a set of constraints Z on the learner. These constraints are absent in the centralized setting. For example, the constraints Z may prohibit the transfer of raw data from each of the sites to a central location, while allowing the learner to obtain certain types of statistics from the individual sites (e.g., counts of instances that have specified values for some subset of attributes). In some applications of data mining (e.g., knowledge discovery from clinical records), Z might include constraints designed to preserve privacy. The problem of learning from distributed data can be stated as follows (Caragea et al., 2004; 2005): Given the fragments D1,..., Dn of a data set D distributed across the sites 1,...,n, a set of constraints Z, a hypothesis class H, and a performance criterion P, the task of the learner Ld is to output a hypothesis that optimizes P using only operations allowed by Z. Clearly, the problem of learning from a centralized data set D is a special case of learning from distributed data where n=1 and Z=∅. Having defined the problem of learning from distributed data, we proceed to define some criteria that can be used to evaluate the quality of the hypothesis produced by an algorithm Ld for Figure 2: Learning = Statistical Query Answering & Hypothesis Generation Query s (D,hi ) Answer s (D,hi) Statistical Query Formulation Hypothesis Generation hi+1=R(hi, s(D,hi)) Learner Data D learning from distributed data relative to its centralized counterpart. We say that an algorithm Ld for learning from distributed data sets D1, ..., Dn is exact relative to its centralized counterpart L if the hypothesis produced by Ld is identical to that obtained by L from the data set D obtained by appropriately combining the data sets D1,..., Dn. The proof of exactness of an algorithm for learning from distributed data relative to its centralized counterpart ensures that a large collection of existing theoretical (e.g., sample complexity, error bounds) as well as empirical results obtained in the centralized setting apply in the distributed setting. MAIN THRUST: STRATEGY FOR LEARNING FROM DISTRIBUTED DATA Decomposition of the Learning from Data Task A general strategy for designing algorithms for learning from distributed data that are provably exact with respect to their centralized counterpart (in the sense defined above) follows from the observation that most of the learning algorithms use only certain statistics computed from the data D in the process of generating the hypotheses that they output. (Recall that a statistic is simply a function of the data. Examples of statistics include mean value of an attribute, counts of instances that have specified values for some subset of attributes, the most frequent value of an attribute, etc.). This yields a natural decomposition of a learning algorithm into two components (see Figure 2): a. An information extraction component that formulates and sends a statistical query to a data source and b. A hypothesis generation component that uses the resulting statistic to modify a partially constructed hypothesis (and it may further invoke the information extraction component as needed). Sufficient Statistics for Learning A statistic s(D) is called a sufficient statistic for a parameter θ if s(D), loosely speaking, provides all the information needed for estimating the parameter from data D. Thus, sample mean is a sufficient statistic for the mean of a Gaussian distribution. A sufficient statistic s for a parameter θ is called a minimal sufficient statistic if for every sufficient statistic sθ for θ, there exists a function ( ) ( ) ( ) D s D s gs = θ θ (Casella & Berger, 2001). This notion of a sufficient statistic for a parameter θ can be generalized to yield a sufficient statistic ( ) D s h L, for learning a hypothesis h using a learning algorithm L applied to a data set D (Caragea et al., 2004). Trivially, the data D is a sufficient statistic for learning h using L. However, we are typically interested in statistics that are minimal or at the very least, substantially smaller in size (in terms of the number of bits needed for encod

[1]  Vasant Honavar,et al.  AVT-NBL: an algorithm for learning compact and accurate naive Bayes classifiers from attribute value taxonomies and data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[2]  Vasant Honavar,et al.  Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources , 2005, Discovery Science.

[3]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[4]  Vasant Honavar,et al.  Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources , 2005, ALT.

[5]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[6]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[7]  Philip K. Chan,et al.  Meta-learning in distributed data mining systems: Issues and approaches , 2007 .

[8]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[9]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[10]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[11]  Vasant Honavar,et al.  Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data , 2006, Knowledge and Information Systems.

[12]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[13]  Pedro M. Domingos Knowledge Acquisition from Examples Via Multiple Models , 1997 .

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Yike Guo,et al.  Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets , 2001 .

[16]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[17]  Robert L. Grossman,et al.  Data mining tasks and methods: parallel methods for scaling data mining algorithms to large data sets , 2002 .