A characterization of data mining algorithms on a modern processor

In this paper, we characterize the performance and memory access behavior of several data mining algorithms. Specically, we consider algorithms for frequent itemset mining, sequence mining, graph mining, clustering, outlier detection, and decision tree induction. Our study reveals that data mining algorithms are compute and memory intensive. Furthermore, some algorithms have poor spatial locality, while most algorithms have poor temporal locality. Hardware prefetching helps the algorithms with good spatial locality, but most algorithms are unable to leverage simultaneous multithreading because of their memory intensive nature. Consequently, all these algorithms grossly under-utilize a modern day processor. Using the knowledge gleaned in this investigation, we briey show how we improve the performance of a frequent itemset mining algorithm, FPGrowth, on a modern processor. Our study suggests that a specialized memory system with several thread contexts per processor is needed to allow these algorithms to scale on future microprocessors.

[1]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[2]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[3]  Todd C. Mowry,et al.  Improving index performance through prefetching , 2001, SIGMOD '01.

[4]  S. Parekh,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[7]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[8]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[10]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[12]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[14]  José A. B. Fortes,et al.  Performance and memory-access characterization of data mining applications , 1998, Workload Characterization: Methodology and Case Studies. Based on the First Workshop on Workload Characterization.

[15]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[16]  Jin-Soo Kim,et al.  Memory characterization of a parallel data mining workload , 1998, Workload Characterization: Methodology and Case Studies. Based on the First Workshop on Workload Characterization.

[17]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[18]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[19]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[20]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[21]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[22]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.