Simulation of parallel similarity measure computations for large data sets

The paper presents our approach to implementation of similarity measure for big data analysis in a parallel environment. We describe the algorithm for parallelisation of the computations. We provide results from a real MPI application for computations of similarity measures as well as results achieved with our simulation software. The simulation environment allows us to model parallel systems of various sizes with various components such as CPUs, GPUs, network interconnects, and model parallel applications in a meta language. The simulations allow us to determine in details how computations will be performed on a particular hardware. They also allow to predict the shapes of time curves beyond the area where empirical results can be obtained due to limited computational resources such as memory capacity.

[1]  Matt Welsh,et al.  Simulating the power consumption of large-scale sensor network applications , 2004, SenSys '04.

[2]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[3]  Christoph W. Kessler,et al.  Load balancing of irregular parallel divide-and-conquer algorithms in group-SPMD programming environments , 2006, ARCS Workshops.

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  Julian Szymanski,et al.  Thresholding strategies for large scale multi-label text classifier , 2013, 2013 6th International Conference on Human System Interactions (HSI).

[6]  Julian Szymanski,et al.  Self-Organizing Map Representation for Clustering Wikipedia Search Results , 2011, ACIIDS.

[7]  Angela B. Shiflet,et al.  Introduction to Computational Science: Modeling and Simulation for the Sciences , 2006 .

[8]  Julian Szymański,et al.  Comparative Analysis of Text Representation Methods Using Classification , 2014, Cybern. Syst..

[9]  Marzena Kryszkiewicz,et al.  Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors , 2013, Fundam. Informaticae.

[10]  Maximilian Röglinger,et al.  Big Data , 2013, Bus. Inf. Syst. Eng..

[11]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[12]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[13]  Julian Szymanski Mining Relations between Wikipedia Categories , 2010, NDT.

[14]  Pawel Czarnul A Workflow Application for Parallel Processing of Big Data from an Internet Portal , 2014, ICCS.

[15]  Hao Wu,et al.  Large-scale network simulation: how big? how fast? , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[16]  Katharine Armstrong,et al.  Big data: a revolution that will transform how we live, work, and think , 2014 .

[17]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[18]  Emilio Luque,et al.  A tool for efficient execution of SPMD applications on multicore clusters , 2010, ICCS.

[19]  Alain Venot,et al.  A new class of similarity measures for robust image registration , 1984, Comput. Vis. Graph. Image Process..

[20]  Jaroslaw Kuchta,et al.  Parallel Computations in the Volunteer-Based Comcute System , 2013, PPAM.

[21]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .