Using Spearman's correlation coefficients for exploratory data analysis on big dataset

Correlation analysis is both popular and useful in a number of social networking research, particularly in the exploratory data analysis. In this paper, three well‐known and often‐used correlation coefficients, Pearson product–moment correlation coefficient, Spearman, and Kendall rank correlation coefficients, are compared from definition to application domain. Based on the characteristics of the pump's vibration dataset, the nonparametric and distribution‐free Spearman rank correlation coefficient is introduced to analyze the relationship between the pump's working state and each of the 207′880 variables. The percentage of variables and exact variables' tables with high Spearman's correlation coefficients for states I and II, states I and III, states II and III, and three states in different files are obtained respectively, which has important valuation for the future research of the unsupervised machine learning system. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[2]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[3]  Jesse C Arnold Professor Emeritus,et al.  Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences , 1990 .

[4]  S. Stigler Francis Galton's Account of the Invention of Correlation , 1989 .

[5]  Eric R. Ziegel,et al.  Probability and Statistics for Engineering and the Sciences , 2004, Technometrics.

[6]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[7]  R. Haining Spatial Data Analysis in the Social and Environmental Sciences , 1990 .

[8]  Johnson M Kariuki,et al.  Factors influencing sustainability of non government organizations funded community projects in kenya: a case of action aid funded project in Makima location, Embu county , 2014 .

[9]  Martin Krzywinski,et al.  Significance, P values and t-tests , 2013, Nature Methods.

[10]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[11]  P M Bentler,et al.  A two-stage estimation of structural equation models with continuous and polytomous variables. , 1995, The British journal of mathematical and statistical psychology.

[12]  Robert Haining,et al.  Spatial Data Analysis in the Social and Environmental Sciences , 1990 .

[13]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[14]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.