A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics

A prescription is presented for a new and practical correlation coefficient, $\phi_K$, based on several refinements to Pearson's hypothesis test of independence of two variables. The combined features of $\phi_K$ form an advantage over existing coefficients. First, it works consistently between categorical, ordinal and interval variables. Second, it captures non-linear dependency. Third, it reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution. These are useful features when studying the correlation between variables with mixed types. Particular emphasis is paid to the proper evaluation of statistical significance of correlations and to the interpretation of variable relationships in a contingency table, in particular in case of low statistics samples and significant dependencies. Three practical applications are discussed. The presented algorithms are easy to use and available through a public Python library.

[1]  O. William Journal Of The American Statistical Association V-28 , 1932 .

[2]  Yeawon Yoo,et al.  A new correlation coefficient for comparing and aggregating non-strict and incomplete rankings , 2020, Eur. J. Oper. Res..

[3]  L. A. Goodman,et al.  Measures of Association for Cross Classifications III: Approximate Sampling Theory , 1963 .

[4]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[5]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  W. G. Cochran The $\chi^2$ Test of Goodness of Fit , 1952 .

[8]  A. Agresti [A Survey of Exact Inference for Contingency Tables]: Rejoinder , 1992 .

[9]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[10]  Petter Laake,et al.  Statistical Analysis of Contingency Tables , 2017 .

[11]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[12]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[13]  G. Barnard,et al.  A New Test for 2 × 2 Tables , 1945 .

[14]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[15]  L. A. Goodman,et al.  Measures of Association for Cross Classifications. II: Further Discussion and References , 1959 .

[16]  Christian Genest,et al.  Generalized linear models for dependent frequency and severity of insurance claims , 2015 .

[17]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[18]  Scott M. Smith,et al.  Fundamentals of Marketing Research , 2004 .

[19]  Past, Present, and Future of Statistical Science , 2015 .

[20]  H. Cramér Mathematical methods of statistics , 1947 .

[21]  L. A. Goodman,et al.  Measures of Association for Cross Classifications, IV: Simplification of Asymptotic Variances , 1972 .

[22]  A Agresti,et al.  Exact inference for categorical data: recent advances and continuing controversies , 2001, Statistics in medicine.

[23]  Matts Roos,et al.  MINUIT-a system for function minimization and analysis of the parameter errors and correlations , 1984 .

[24]  H. H. Ku,et al.  Notes on the use of propagation of error formulas , 1966 .

[25]  Evaluation of three methods for calculating statistical significance when incorporating a systematic uncertainty into a test of the background-only hypothesis for a Poisson process , 2007, physics/0702156.

[26]  Alan Agresti,et al.  Nearly exact tests of conditional independence and marginal homogeneity for sparse contingency tables , 1997 .

[27]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[28]  Pieter M. Kroonenberg,et al.  The Tale of Cochran's Rule: My Contingency Table has so Many Expected Values Smaller than 5, What Am I to Do? , 2018 .

[29]  H. O. Lancaster,et al.  Significance Tests in Discrete Distributions , 1961 .

[30]  G. Barnard Significance tests for 2 X 2 tables. , 1947, Biometrika.

[31]  G. A. Barnard Introduction to Pearson (1900) On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1992 .

[32]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[33]  Measures of Significance in HEP and Astrophysics , 2003, physics/0312059.

[34]  R. D. Veaux,et al.  Stats: Modeling the World , 2003 .

[35]  M. P. Casado,et al.  Search for the direct production of charginos and neutralinos in final states with tau leptons in $$\sqrt{s} = 13\,\mathrm{TeV}$$s=13TeVpp collisions with the ATLAS detector , 2018, The European physical journal. C, Particles and fields.

[36]  R. Cox,et al.  Journal of the Royal Statistical Society B , 1972 .

[37]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[38]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[39]  W. Patefield,et al.  An Efficient Method of Generating Random R × C Tables with Given Row and Column Totals , 1981 .

[40]  Fritz Drasgow,et al.  Polychoric and Polyserial Correlations , 2006 .