Fractal Dimension Calculation for Big Data Using Box Locality Index

The box-counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pairs with the key indexing the location of a “box” (i.e., a grid cell on the multi-dimensional space) and the value counting the number of data points inside the box (i.e., “box occupancy”). Such a key-value pair structure of BLI significantly simplifies the traditionally used hierarchical structure and encodes only necessary information required by the box-counting approach for fractal dimension calculation. Moreover, as the box occupancies (i.e., the values) associated with the same index (i.e., the key) are aggregatable, the BLI grants the box-counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box-counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for estimating the intrinsic dimension of big data, which is essential for many machine learning methods and data analytics including feature selection and dimensionality reduction.

[1]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[2]  Alessandro Rozza,et al.  Novel high intrinsic dimensionality estimators , 2012, Machine Learning.

[3]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[4]  S. Durga Bhavani,et al.  Feature selection using correlation fractal dimension: Issues and applications in binary classification problems , 2008, Appl. Soft Comput..

[5]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[6]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[7]  Sinisa Todorovic,et al.  Local-Learning-Based Feature Selection for High-Dimensional Data Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Christos Faloutsos,et al.  Spatial join selectivity using power laws , 2000, SIGMOD 2000.

[9]  Francesco Camastra,et al.  Data dimensionality estimation methods: a survey , 2003, Pattern Recognit..

[10]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[11]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[12]  Zhanhuai Li,et al.  The Practical Method of Fractal Dimensionality Reduction Based on Z-Ordering Technique , 2006, ADMA.

[13]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[14]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[15]  Daling Wang,et al.  Performance Optimization of Fractal Dimension Based Feature Selection Algorithm , 2004, WAIM.

[16]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[17]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[18]  Vincenzo Carnevale,et al.  Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets , 2016, Scientific Reports.

[19]  Antonino Staiano,et al.  Intrinsic dimension estimation: Advances and open problems , 2016, Inf. Sci..

[20]  Thomas Villmann,et al.  Magnification Control in Self-Organizing Maps and Neural Gas , 2006, Neural Computation.

[21]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[22]  Yong Shi,et al.  Spatial distance join based feature selection , 2013, Eng. Appl. Artif. Intell..

[23]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[24]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[25]  Lutz Mädler,et al.  High content screening in zebrafish speeds up hazard ranking of transition metal oxide nanoparticles. , 2011, ACS nano.

[26]  Qian Du,et al.  An improved box-counting method for image fractal dimension estimation , 2009, Pattern Recognit..

[27]  M. Castellani,et al.  Novel feature selection method using mutual information and fractal dimension , 2009, 2009 35th Annual Conference of IEEE Industrial Electronics.

[28]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[29]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[30]  Ambuj K. Singh,et al.  Dimensionality Reduction for Similarity Searching in Dynamic Databases , 1999, Comput. Vis. Image Underst..

[31]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[32]  M. C. Monard,et al.  A Fractal Dimension Based Filter Algorithm to Select Features for Supervised Learning , 2006, IBERAMIA-SBIA.

[33]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[34]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[35]  Balázs Kégl,et al.  Intrinsic Dimension Estimation Using Packing Numbers , 2002, NIPS.

[36]  Philip S. Yu,et al.  Bag Constrained Structure Pattern Mining for Multi-Graph Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[37]  Samuel H. Huang,et al.  Fractal-Based Intrinsic Dimension Estimation and Its Application in Dimensionality Reduction , 2012, IEEE Transactions on Knowledge and Data Engineering.

[38]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[39]  Christos Faloutsos,et al.  Deflating the dimensionality curse using multiple fractal dimensions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[40]  Christos Faloutsos,et al.  Fast feature selection using fractal dimension , 2010, J. Inf. Data Manag..

[41]  StaianoAntonino,et al.  Intrinsic dimension estimation , 2016 .

[42]  Christos Faloutsos,et al.  A fast and effective method to find correlations among attributes in databases , 2007, Data Mining and Knowledge Discovery.

[43]  Larry Wasserman,et al.  All of Statistics , 2004 .

[44]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[45]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[46]  Emmanuel Sirimal Silva,et al.  Forecasting with Big Data: A Review , 2015, Annals of Data Science.

[47]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[48]  Christos Faloutsos,et al.  Fast Feature Selection using Fractal Dimension - Ten Years Later , 2010, J. Inf. Data Manag..