A Distributed Approach for High-Dimensionality Heterogeneous Data Reduction

The recent explosion of data size in number of records and attributes has triggered the development of a number of Big Data analytics as well as parallel data processing methods and algorithms. At the same time though, it has pushed for usage of data Dimensionality Reduction (DR) procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performance in data analytics applications, and this may be caused by the presence of missing data. These latter are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. In this work, we propose a new distributed statistical approach for high-dimensionality reduction of heterogeneous data that is based on the MapReduce paradigm, limits the curse of dimensionality and deals with missing values. To handle these latter, we propose to use the Random Forest imputation’s method. The main purpose here is to extract useful information and reduce the search space to facilitate the data exploration process. Several illustrative numeric examples using data coming from publicly available machine learning repositories are also included. The experimental component of the study shows the efficiency of the proposed analytical approach.

[1]  Aboul Ella Hassanien,et al.  Dimensionality reduction of medical big data using neural-fuzzy classifier , 2014, Soft Computing.

[2]  Mario Cannataro,et al.  Big Data Analysis in Bioinformatics , 2019, Encyclopedia of Big Data Technologies.

[3]  J. Pagès Analyse factorielle de données mixtes , 2004 .

[4]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[5]  Roderick J. A. Little,et al.  The Analysis of Social Science Data with Missing Values , 1989 .

[6]  Sufficient dimension reduction and prediction through cumulative slicing PFC , 2018 .

[7]  Souad El Bernoussi,et al.  Mining human activity using dimensionality reduction and pattern recognition , 2016 .

[8]  Marco Cavallo,et al.  A Visual Interaction Framework for Dimensionality Reduction Based Data Exploration , 2018, CHI.

[9]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[10]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[11]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[12]  Kaja Abbas,et al.  Big Data Analytics in Healthcare Internet of Things , 2017 .

[13]  Sid Ahmed Amine Benaouda Implantation du modèle MapReduce dans l’environnement distribué Hadoop : Distribution Cloudera , 2015 .

[14]  Zhi Liu,et al.  Application of Synergetic Neural Network in Online Writeprint Identification , 2011 .

[15]  Olfa Arfaoui,et al.  Vers une approche heuristique distribuée à base d'ontologie pour la fouille des règles d'association dans les données massives , 2019, European Grid Conference.

[16]  Mario C. Cirillo,et al.  On the use of the normalized mean square error in evaluating dispersion model performance , 1993 .

[17]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[18]  Gilbert Saporta,et al.  The NIPALS Algorithm for Missing Functional Data , 2010 .

[19]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[20]  Emma Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[21]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[22]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[23]  Baijian Yang,et al.  Big Data Dimension Reduction Using PCA , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).

[24]  Sungkyu Jung,et al.  Continuum directions for supervised dimension reduction , 2016, Comput. Stat. Data Anal..

[25]  Chris H. Q. Ding Dimension Reduction Techniques for Clustering , 2009, Encyclopedia of Database Systems.

[26]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[27]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[28]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[29]  Olfa Arfaoui,et al.  An Ontology-driven MapReduce Framework for Association Rules Mining in Massive Data , 2018, KES.

[30]  M. Rauf Ahmad,et al.  A significance test of the RV coefficient in high dimensions , 2019, Comput. Stat. Data Anal..

[31]  Eytan Ruppin,et al.  Feature Selection Based on the Shapley Value , 2005, IJCAI.

[32]  Michel Tenenhaus,et al.  Analyse en composantes principales d'un ensemble de variables nominales ou numériques , 1977 .

[33]  Michel Tenenhaus,et al.  An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data , 1985 .

[34]  Olfa Arfaoui,et al.  ParallelCharMax: An Effective Maximal Frequent Itemset Mining Algorithm Based on MapReduce Framework , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[35]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[36]  Yan Lin,et al.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how? , 2014, BMC Bioinformatics.

[37]  Zongkai Yang,et al.  Variable Length Character N-Gram Approach for Online Writeprint Identification , 2010, 2010 International Conference on Multimedia Information Networking and Security.

[38]  Zhiyong Peng,et al.  From Big Data to Big Data Mining: Challenges, Issues, and Opportunities , 2013, DASFAA Workshops.