Parallel Data Mining from Multicore to Cloudy Grids

We describe a suite of data mining tools that cover clustering, information retrieval and the mapping of high dimensional data to low dimensions for visualization. Preliminary applications are given to particle physics, bioinformatics and medical informatics. The data vary in dimension from low (220), high (thousands) to undefined (sequences with dissimilarities but not vectors defined). We use deterministic annealing to provide more robust algorithms that are relatively insensitive to local minima. We discuss the algorithm structure and their mapping to parallel architectures of different types and look at the performance of the algorithms on three classes of system; multicore, cluster and Grid using a MapReduce style algorithm. Each approach is suitable in different application scenarios. We stress that data analysis/mining of large datasets can be a supercomputer application.

[1]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[2]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[3]  Geoffrey C. Fox,et al.  Parallel Data Mining on Multicore Clusters , 2008, 2008 Seventh International Conference on Grid and Cooperative Computing.

[4]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[5]  Joachim M. Buhmann,et al.  Data visualization by multidimensional scaling: a deterministic annealing approach , 1996, Pattern Recognit..

[6]  Geoffrey C. Fox,et al.  NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids , 2003, Middleware.

[7]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  P. Pevzner,et al.  Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. , 2004, Genome research.

[10]  J. Leeuw Convergence of the majorization method for multidimensional scaling , 1988 .

[11]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[12]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[13]  Seung-Hee Bae Parallel Multidimensional Scaling Performance on Multicore Systems , 2008, 2008 IEEE Fourth International Conference on eScience.

[14]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[15]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[18]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[19]  Jeffrey S. Wilson,et al.  Neighborhood greenness and 2-year changes in body mass index of children and youth. , 2008, American journal of preventive medicine.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Xiaohong Qiu Parallel Data Mining for Medical Informatics , 2009 .

[22]  J. Leeuw Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[23]  Corporate The MPI Forum MPI: a message passing interface , 1993, Supercomputing '93.

[24]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[25]  Andrew Lumsdaine,et al.  Design and implementation of a high-performance MPI for C# and the common language infrastructure , 2008, PPOPP.

[26]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[27]  Satnam Singh,et al.  An Asynchronous Messaging Library for C , 2005 .

[28]  Anthony J. Kearsley,et al.  The Solution of the Metric STRESS and SSTRESS Problems in Multidimensional Scaling Using Newton's Method , 1995 .

[29]  GhemawatSanjay,et al.  The Google file system , 2003 .

[30]  Geoffrey C. Fox,et al.  Performance of Multicore Systems on Parallel Data Clustering with Deterministic Annealing , 2008, ICCS.

[31]  Geoffrey C. Fox,et al.  PARALLEL CLUSTERING AND DIMENSIONAL SCALING ON MULTICORE SYSTEMS , 2008 .

[32]  Geoffrey C. Fox,et al.  Grids challenged by a Web 2.0 and multicore sandwich , 2009, Concurr. Comput. Pract. Exp..

[33]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[34]  Forrest W. Young,et al.  Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features , 1977 .