Survey on distributed data mining

With the continuous technological breakthroughs in the fields of network and communication,modern networks as Internet,mobile network,broadcast network and their derivative business have been quickly developed.The ubiquitous distributed computing environments in cyberspace are popularly constructed.To maximize the value of the data accumulated in the distributed computing environments,we need to use data mining technology to discover the hidden patterns or rules.The knowledge(patterns or rules)can be used for management decision supporting in daily production or operation in order to improve the decision-making level and consequent gains.However,subject to the prevailing heterogeneity,proprietary,platform compatibility and other restrictions,and also considering industry completion and legal constraints,etc.,(such as personal or corporate data privacy issues),the data sources interconnected by networks are difficult to be centrally mined,so distributed data mining(DDM)technology came into being. First,this paper introduces the definition,framework,applicable scenarios and current research challenges of DDM.According to the high-level architecture of DDM introduced in this article,the quality of final result is closely related to the local data source's type,availability,quality and integration method of the local results.DDM may not be implemented in a purely independent(between the sites)manner.In addition,when the data are centralized and there are distributed sites in the system, DDM can also be adopted.Currently,the main challenges in DDM research fields are:heterogeneous and homogeneous mining, data variability in dynamic environment,communication cost,knowledge integration and semantic heterogeneity and so on.Second,the current DDM systems are divided into four categories:1)System based on multi-agent.Agent's autonomy is used for local mining to protect data privacy;Agent's initiative is used to reduce the user involvement to improve the level of automation in mining;Agent's collaboration is used for multi-algorithm cooperative mining.2)Grid-based system.Making use of the grid advantages in terms of resource sharing,open services,and collaborative work,reliability and interoperability are improved in mining. 3)System based on meta-learning.Through meta-learning,the mining algorithm selection and combination are optimized,and the quality of the results is improved by repeatedly training of the knowledge.4)System based on CDM(Collective Data Mining)framework. The function to be learned is expressed as a set of distributed basis functions;the data source is allowed to select different learning algorithm,and the overall network traffic is decreased on the premise that the global result is correct. Furthermore,the common issues exist in the current DDM research fields are summarized:1)Result quality.DDM system does not consider the intrinsic semantic relations among data sources of each site.Each site independently mines the local data, and there are no data interchange or fusion with other sites at semantic level.The DDM job is executed in form of pure "splitstyle",which eventually damages the quality of the global result.2)Mining efficiency.It's the problem about how to schedule resources to achieve loading balancing,reducing communication cost in collaborative mining. For result quality,this paper explores the solution to combine ontology and data mining.As the basis of the Semantic Web,ontology can provide effective support for measuring the semantic distance between objects.Currently,researchers have already conducted exploratory work that describes the field context of mining task with ontology,and the data mining process itself with ontology.For example,for selection of effective ones from massive rules in association rule mining,some people proposed interactive,post-mining approach for the deletion of redundant rules.Given the premise of process input and output types of the knowledge discovery,Some people provided a solution concerning the automatically constructed problem of knowledge discovery workflow. Through the description of this paper,we can find that if we want to improve the quality of distributed local results in mining process and final global result,one of strategies is to compromise the DDM theory and the ontology theory,making the semantic distance measurement between data sources as a breakthrough,and establishing a compound quantification system for semantic distance measurement,finally achieving the goal by building and solving new DDM model.