Toward a Distributed Knowledge Discovery system for Grid systems

During the last decade or so, we have had a deluge of data from not only science fields but also industry and commerce fields. Although the amount of data available to us is constantly increasing, our ability to process it becomes more and more difficult. Efficient discovery of useful knowledge from these datasets is therefore becoming a challenge and a massive economic need. This led to the need of developing large-scale data mining (DM) techniques to deal with these huge datasets either from science or economic applications. In this chapter, we present a new DDM system combining dataset-driven and architecture-driven strategies. Data-driven strategies will consider the size and heterogeneity of the data, while architecture driven will focus on the distribution of the datasets. This system is based on a Grid middleware tools that integrate appropriate large data manipulation operations. Therefore, this allows more dynamicity and autonomicity during the mining, integrating and processing phases

[1]  Anthony Rowe,et al.  Discovery net: towards a grid of knowledge discovery , 2002, KDD.

[2]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[3]  Michael J. Pazzani,et al.  A Principal Components Approach to Combining Regression Estimates , 1999, Machine Learning.

[4]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[5]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[6]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[7]  M. Tahar Kechadi,et al.  Variance-based Distributed Clustering , 2017 .

[8]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[9]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  Sudha Krishnamurthy,et al.  Hp-rmi : high performance java rmi over fm , 1997 .

[11]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[12]  Bruce G. Buchanan,et al.  The MYCIN Experiments of the Stanford Heuristic Programming Project , 1985 .

[13]  M. Tahar Kechadi,et al.  TreeP: A Self-reconfigurable Topology for Unstructured P2P Systems , 2006, PARA.

[14]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[15]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[16]  Matthias Klusch,et al.  Distributed data mining and agents , 2005, Eng. Appl. Artif. Intell..

[17]  M. Tahar Kechadi,et al.  Performance study of distributed Apriori-like frequent itemsets mining , 2010, Knowledge and Information Systems.

[18]  Dennis P. Groth,et al.  Average-Case Performance of the Apriori Algorithm , 2004, SIAM J. Comput..

[19]  Nhien-An Le-Khac,et al.  An Efficient Knowledge Management Tool for Distributed Data Mining Environments , 2009 .

[20]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[21]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[22]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[23]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[24]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[25]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[26]  Jin-Fu Chang,et al.  Knowledge Representation Using Fuzzy Petri Nets , 1990, IEEE Trans. Knowl. Data Eng..

[27]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[28]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[29]  Umeshwar Dayal,et al.  K-Harmonic Means - A Data Clustering Algorithm , 1999 .

[30]  Boris Novikov,et al.  An Indexing Algorithm for Text Retrieval , 1996, ADBIS.

[31]  A M. Tjoa,et al.  GridMiner : A Framework for Knowledge Discovery on the Grid-from a Vision to Design and Implementation , 2005 .

[32]  Sanjay Ranka,et al.  A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data , 1997, VLDB.

[33]  M. Tahar Kechadi,et al.  Admire framework: Distributed data mining on data grid platforms , 2006, ICSOFT.

[34]  Keying Ye,et al.  Determining the Number of Clusters Using the Weighted Gap Statistic , 2007, Biometrics.

[35]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[36]  T. Buzan,et al.  The Mind Map Book , 1993 .

[37]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[38]  Mark N. Wexler,et al.  The who, what and why of knowledge mapping , 2001, J. Knowl. Manag..

[39]  Peter Brezany,et al.  GridMiner: An Infrastructure for Data Mining on Computational Grids , 2003 .

[40]  Fu-Ren Lin,et al.  Knowledge map creation and maintenance for virtual communities of practice , 2006, Inf. Process. Manag..

[41]  Martin J. Eppler Making knowledge visible through intranet knowledge maps: concepts, elements, cases , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[42]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[43]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[44]  Fu-Ren Lin,et al.  Knowledge map creation and maintenance for virtual communities of practice , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[45]  Yi Deng,et al.  A G-Net Model for Knowledge Representation and Reasoning , 1990, IEEE Trans. Knowl. Data Eng..

[46]  Yike Guo,et al.  An Architecture for Distributed Enterprise Data Mining , 1999, HPCN Europe.

[47]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[48]  James L. Peterson,et al.  Petri Nets , 1977, CSUR.

[49]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[50]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[51]  M. Tahar Kechadi,et al.  Lightweight Clustering Technique for Distributed Data Mining Applications , 2007, Industrial Conference on Data Mining.

[52]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[53]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[54]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[55]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[56]  M. Fischetti Working knowledge. , 2003, Scientific American.

[57]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[58]  Mario Cannataro,et al.  A data mining toolset for distributed high- performance platforms , 2002 .

[59]  Francesco Corea,et al.  Introduction to Data , 2017, IBM SPSS Essentials.

[60]  M. Tahar Kechadi,et al.  A New Approach for Distributed Density Based Clustering on Grid Platform , 2007, BNCOD.

[61]  Donald F. Ferguson,et al.  The WS-Resource Framework , 2004 .

[62]  Mario Cannataro,et al.  Distributed data mining on the grid , 2002, Future Gener. Comput. Syst..

[63]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[64]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[65]  Benoit Hudzia,et al.  Entity Based Peer-to-Peer in a Data Grid Environment , 2006, ArXiv.

[66]  Anand Sivasubramaniam,et al.  PENS: an algorithm for density-based clustering in peer-to-peer systems , 2006, InfoScale '06.

[67]  Alistair Moffat,et al.  Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files , 1993, VLDB.

[68]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[69]  M. Tahar Kechadi,et al.  An efficient support management tool for distributed data mining environments , 2007, 2007 2nd International Conference on Digital Information Management.

[70]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[71]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.