A model for mining distributed frequent sequences

Data mining aims to discover patterns and extract useful information from facts recorded in databases. However, real life applications are inherently distributed, and thus distributed data mining is a more natural way to view data mining generally. A common approach when mining geographically distributed data is to build separate models at geographically distributed sites and then to combine the models at a central site. At the other extreme, all of the data can be moved to a central site and a single model built. With the internet commodity and large datasets the former approach is the quicker but often the less accurate, while the latter approach is more accurate but generally quite expensive in terms of the time required. This thesis introduces a new framework and methodology for distributed data mining (DDM) that is intermediate solution between the above two approaches. It is intermediate because it adopts the first approach but combines models only when strong evidence on their similarity exists. This improves accuracy and accelerates time response. In this model, differences and similarities between distributed data sites are explicitly addressed and expressed via a similarity values between sites. The framework reduces the problem into a similarity problem between models. To solve the reduced problem a similarity measure was required. A similarity measure based on the idea that a similarity notion should reflect how much work is needed to transform one model to another is formalized and verified through experiments. The application of this similarity measure on datasets places them in clusters according to their similarities. At the end, sites in each cluster participate in one global model. Experiments on this framework show that models resulted from the proposed strategy have better results when compared to those resulting from central strategy. It also showed that this framework effectively bridges two simple approaches to distributed data mining which are common today.