Sampling operations on big data

The 3Vs - Volume, Velocity and Variety - of Big Data continues to be a large challenge for systems and algorithms designed to store, process and disseminate information for discovery and exploration under real-time constraints. Common signal processing operations such as sampling and filtering, which have been used for decades to compress signals are often undefined in data that is characterized by heterogeneity, high dimensionality, and lack of known structure. In this article, we describe and demonstrate an approach to sample large datasets such as social media data. We evaluate the effect of sampling on a common predictive analytic: link prediction. Our results indicate that greatly sampling a dataset can still yield meaningful link prediction results.

[1]  Jeremy Kepner,et al.  Graphulo: Linear Algebra Graph Kernels for NoSQL Databases , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[2]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[3]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[4]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[5]  Jeremy Kepner,et al.  Big data dimensional analysis , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[6]  Tamara G. Kolda,et al.  Triadic Measures on Graphs: The Power of Wedge Sampling , 2012, SDM.

[7]  Jeremy Kepner,et al.  D4M: Bringing associative arrays to database engines , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[8]  Jeremy Kepner,et al.  Associative Arrays: Unified Mathematics for Spreadsheets, Databases, Matrices, and Graphs , 2015, ArXiv.

[9]  政子 鶴岡,et al.  1998 IEEE International Conference on SMCに参加して , 1998 .

[10]  Vijay Gadepally,et al.  Sampling large graphs for anticipatory analytics , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  Lawrence B. Holder,et al.  Frequent subgraph mining on a single large graph using sampling techniques , 2010, MLG '10.

[12]  Jeremy Kepner,et al.  LLSuperCloud: Sharing HPC systems for diverse rapid prototyping , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[13]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[14]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[15]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[16]  Jeremy Kepner,et al.  Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Dino Pedreschi,et al.  Human mobility, social ties, and link prediction , 2011, KDD.