High-performance data mining with skeleton-based structured parallel programming

We show how to apply a Structured Parallel Programming methodology based on skeletons to Data Mining problems, reporting several results about three commonly used mining techniques, namely association rules, decision tree induction and spatial clustering. We analyze the structural patterns common to these applications, looking at application performance and software engineering efficiency. Our aim is to clearly state what features a Structured Parallel Programming Environment should have to be useful for parallel Data Mining. Within the skeleton-based PPE SkIE that we have developed, we study the different patterns of data access of parallel implementations of Apriori, C4.5 and DBSCAN. We need to address large partitions reads, frequent and sparse access to small blocks, as well as an irregular mix of small and large transfers, to allow efficient development of applications on huge databases. We examine the addition of an object/component interface to the skeleton structured model, to simplify the development of environment-integrated, parallel Data Mining applications.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Massimo Coppola,et al.  Mining of Association Rules in Very Large Databases: A Structured Parallel Approach , 1999, Euro-Par.

[3]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[4]  Graham J. Williams,et al.  The Integrated Delivery of Large-Scale Data Mining: The ACSys Data Mining Project , 1999, Large-Scale Parallel Data Mining.

[5]  Scott R. Kohn,et al.  Language Interoperability for High-Performance Parallel Scientific Components , 1999, ISCOPE.

[6]  Robert L. Grossman,et al.  A High Performance Implementation of the Data Space Transfer Protocol (DSTP) , 1999, Large-Scale Parallel Data Mining.

[7]  Mohammed J. Zaki,et al.  A Requirements Analysis for Parallel KDD Systems , 2000, IPDPS Workshops.

[8]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[9]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[10]  J. van Leeuwen,et al.  Computing in Object-Oriented Parallel Environments , 1999, Lecture Notes in Computer Science.

[11]  Yike Guo,et al.  Parallel Induction Algorithms for Data Mining , 1997, IDA.

[12]  David Pritchard,et al.  Euro-Par’98 Parallel Processing , 1998, Lecture Notes in Computer Science.

[13]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[14]  David B. Skillicorn,et al.  Models and languages for parallel computation , 1998, CSUR.

[15]  Benjamin W. Wah,et al.  Editorial: Two Named to Editorial Board of IEEE Transactions on Knowledge and Data Engineering , 1996 .

[16]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[17]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[18]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[19]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[20]  Yike Guo,et al.  Parallel skeletons for structured composition , 1995, PPOPP '95.

[21]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[22]  Andreas Mueller,et al.  Fast sequential and parallel algorithms for association rule mining: a comparison , 1995 .

[23]  Josva Kleist,et al.  Migration = cloning; aliasing , 1999 .

[24]  Mohammed J. Zaki Parallel and Distributed Data Mining: An Introduction , 1999, Large-Scale Parallel Data Mining.

[25]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[26]  Bruno R. Preiss,et al.  Using Object-Oriented Techniques for Realizing Parallel Architectural Skeletons , 1999, ISCOPE.

[27]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[28]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[29]  Marco Danelutto,et al.  SkIE: A heterogeneous environment for HPC applications , 1999, Parallel Comput..

[30]  Alois Goller,et al.  Parallel and Distributed Processing , 1998, Lecture Notes in Computer Science.

[31]  David B. Skillicorn,et al.  Strategies for parallel data mining , 1999, IEEE Concurr..

[32]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[33]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[34]  Sanjay Ranka,et al.  Parallel out-of-core divide-and-conquer techniques with application to classification trees , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[35]  Marco Vanneschi,et al.  PQE2000: HPC tools for industrial applications , 1998, IEEE Concurr..

[36]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[37]  David B. Skillicorn,et al.  Foundations of parallel programming , 1995 .

[38]  Rizos Sakellariou,et al.  Euro-Par 2001 Parallel Processing , 2001, Lecture Notes in Computer Science.

[39]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[40]  Marco Vanneschi Heterogeneous HPC Environments , 1998, Euro-Par.

[41]  Salvatore Ruggieri,et al.  Efficient C4.5 , 2002, IEEE Trans. Knowl. Data Eng..

[42]  Srinivasan Parthasarathy,et al.  Active Mining in a Distributed Setting , 1999, Large-Scale Parallel Data Mining.

[43]  Massimo Coppola,et al.  Experiments in Parallel Clustering with DBSCAN , 2001, Euro-Par.

[44]  Denis Caromel,et al.  Computing in Object-Oriented Parallel Environments , 2002, Lecture Notes in Computer Science.

[45]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[46]  Masato Oguchi,et al.  Using available remote memory dynamically for parallel data mining application on ATM-connected PC cluster , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[47]  Massimo Coppola,et al.  STRUCTURED PARALLEL PROGRAMMING AND SHARED OBJECTS: EXPERIENCES IN DATA MINING CLASSIFIERS , 2002 .

[48]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[49]  Marco Vanneschi,et al.  The programming model of ASSIST, an environment for parallel and distributed portable applications , 2002, Parallel Comput..

[50]  Massimo Coppola,et al.  Parallelisation of C4.5 as a Particular Divide and Conquer Computation , 2000, IPDPS Workshops.