Towards Data Mining on Emerging Architectures

Recent advances in microprocessor design have given rise to new commodity architectures. One such innovation is to place multiple cores on a single chip, called Chip Multiprocessing (CMP). Each core is an independent computational unit, allowing multiple processes to execute concurrently. A second recent architectural advancement is to allow multiple processes to compete for resources simultaneously on a single core, called simultaneous multithreading (SMT). SMT can improve overall throughput in cases where CPU utilization is low. We investigate the implications of these advances on the design of data mining algorithms. In particular, we focus on frequent graph mining. Mining graph based data sets has practical applications in many areas including molecular substructure discovery, web link analysis, fraud detection, and social network analysis. In this work, we propose a novel approach for parallelizing graph mining on CMP architectures. We design a parallel algorithm with low memory consumption, low bandwidth, and fine task granularity. We show that dynamic partitioning and dynamic task allocation provide a synergy which greatly improves scalability over a naive algorithm, from 5 fold to 27 fold on 32 nodes.

[1]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Proceedings Supercomputing '92.

[2]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[3]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, Large-Scale Parallel Data Mining.

[4]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[5]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[6]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[7]  Ruoming Jin,et al.  An efficient association mining implementation on cluster of SMPs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[8]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Mohammed J. Zaki,et al.  LOGML: Log Markup Language for Web Usage Mining , 2001, WEBKDD.

[10]  Prabhakar Raghavan,et al.  Social Networks on the Web and in the Enterprise , 2001, Web Intelligence.

[11]  Lawrence B. Holder,et al.  Approaches to Parallel Graph-Based Knowledge Discovery , 2001, J. Parallel Distributed Comput..

[12]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Srinivasan Parthasarathy,et al.  Efficient discovery of common substructures in macromolecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  Raj P. Gopalan,et al.  Efficiently Mining Frequent Patterns from Dense Datasets Using a Cluster of Computers , 2003, Australian Conference on Artificial Intelligence.

[17]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[18]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[19]  Mohammed J. Zaki,et al.  Predicting Protein Folding Pathways , 2005, Data Mining in Bioinformatics.

[20]  Thorsten Meinl,et al.  Parallel Mining for Frequent Fragments on a Shared-Memory Multiprocessor - Results and Java-Obstacles , 2005, LWA.

[21]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[22]  Christopher J. Miller,et al.  Retroviral Recombination In Vivo: Viral Replication Patterns and Genetic Structure of Simian Immunodeficiency Virus (SIV) Populations in Rhesus Macaques after Simultaneous or Sequential Intravaginal Inoculation with SIVmac239Δvpx/Δvpr and SIVmac239Δnef , 2005, Journal of Virology.

[23]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[24]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..