An Architectural Characterization Study of Data Mining and Bioinformatics Workloads

Data mining is the process of automatically finding implicit, previously unknown, and potentially useful information from large volumes of data. Advances in data extraction techniques have resulted in tremendous increase in the input data size of data mining applications. Data mining systems, on the other hand, have been unable to maintain the same rate of growth. Therefore, there is an increasing need to understand the bottlenecks associated with the execution of these applications in modern architectures. In this paper, we present MineBench, a publicly available benchmark suite containing fifteen representative data mining applications belonging to various categories: classification, clustering, association rule mining and optimization. First, we highlight the uniqueness of data mining applications. Subsequently, we evaluate the MineBench applications on an 8-way shared memory (SMP) machine and analyze important performance characteristics such as L1 and L2 cache miss rates, branch misprediction rates

[1]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[2]  John Shalf,et al.  Diving deep: data-management and visualization strategies for adaptive mesh refinement simulations , 1999, Comput. Sci. Eng..

[3]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Lieven Eeckhout,et al.  Quantifying the Impact of Input Data Sets on Program Behavior and its Applications , 2003, J. Instr. Level Parallelism.

[6]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[7]  Yimin Zhang,et al.  Characterization and analysis of HMMER and SVM-RFE parallel bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[8]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[9]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[10]  David A. Patterson,et al.  Performance characterization of a Quad Pentium Pro SMP using OLTP workloads , 1998, ISCA.

[11]  Gokhan Memik,et al.  Performance Characterization of Data Mining Applications using MineBench , 2006 .

[12]  Ramakrishnan Srikant,et al.  The Quest Data Mining System , 1996, KDD.

[13]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[14]  José A. B. Fortes,et al.  Performance and memory-access characterization of data mining applications , 1998, Workload Characterization: Methodology and Case Studies. Based on the First Workshop on Workload Characterization.

[15]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[16]  F. Sanchez,et al.  Parallel processing in biological sequence comparison using general purpose processors , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[17]  Ying Liu,et al.  A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets , 2005, PAKDD.

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  John Paul Shen,et al.  Scaling and characterizing database workloads: bridging the gap between research and practice , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[20]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[22]  Josep Torrellas,et al.  The memory performance of DSS commercial workloads in shared-memory multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[23]  Pedro M. Domingos,et al.  Unifying Instance-Based and Rule-Based Induction , 1996 .

[24]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[25]  Yan P. Yuan,et al.  HGBASE: a database of SNPs and other variations in and around human genes , 2000, Nucleic Acids Res..

[26]  Tao Li,et al.  Workload characterization of bioinformatics applications , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[27]  D. Eisenstein,et al.  HOP: A New Group-finding Algorithm for N-Body Simulations , 1997, astro-ph/9712200.

[28]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[29]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[30]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[31]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[32]  Carole Dulong,et al.  Performance Scalability of Data-Mining Workloads in Bioinformatics , 2005 .

[33]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[34]  Jin-Soo Kim,et al.  Memory characterization of a parallel data mining workload , 1998, Workload Characterization: Methodology and Case Studies. Based on the First Workshop on Workload Characterization.