In many health-related studies, investigators wish to assess the strength of an association between 2 measured (continuous) variables. For example, the relation between high-sensitivity C-reactive protein (hs-CRP) and body mass index (BMI) may be of interest. Although BMI is often treated as a categorical variable, eg, underweight, normal, overweight, and obese, a noncategorized version is more detailed and thus may be more informative in terms of detecting associations. Correlation and regression are 2 relevant (and related) widely used approaches for determining the strength of an association between 2 variables. Correlation provides a unitless measure of association (usually linear), whereas regression provides a means of predicting one variable (dependent variable) from the other (predictor variable). This report summarizes correlation coefficients and least-squares regression, including intercept and slope coefficients.
Correlation provides a “unitless” measure of association between 2 variables, ranging from −1 (indicating perfect negative association) to 0 (no association) to +1 (perfect positive association). Both variables are treated equally in that neither is considered to be a predictor or an outcome.
The most commonly used version is the Pearson product-moment coefficient of correlation, r . Suppose one wants to estimate the correlation between X=BMI, denoted for the ith subject as Xi, and Y=hs-CRP, denoted for the ith subject as Yi. This is estimated for a sample of size n (i=1,…, n) using the following formula1: equation ![Formula][1]
where equation ![Formula][2]
and equation ![Formula][3]
Here, ![Graphic][4] indicates the sample mean of X (=BMI), and ![Graphic][5] the sample mean of Y (=hs-CRP). The numerator of r reflects how BMI and hs-CRP co-vary, and the denominator reflects the variability of both BMI and hs-CRP about their respective sample means.
The Pearson correlation coefficient assumes that X and Y are jointly distributed as bivariate normal, ie, X and Y each are normally …
[1]: /embed/graphic-1.gif
[2]: /embed/graphic-2.gif
[3]: /embed/graphic-3.gif
[4]: /embed/inline-graphic-1.gif
[5]: /embed/inline-graphic-2.gif
[1]
Richard F. Gunst,et al.
Applied Regression Analysis
,
1999,
Technometrics.
[2]
Peter J. Rousseeuw,et al.
Robust regression and outlier detection
,
1987
.
[3]
D. Ragland,et al.
Dichotomizing Continuous Outcome Variables: Dependence of the Magnitude of Association and Statistical Power on the Cutpoint
,
1992,
Epidemiology.
[4]
W. W. Muir,et al.
Regression Diagnostics: Identifying Influential Data and Sources of Collinearity
,
1980
.
[5]
Malik Beshir Malik,et al.
Applied Linear Regression
,
2005,
Technometrics.
[6]
Norman R. Draper,et al.
Applied regression analysis (2. ed.)
,
1981,
Wiley series in probability and mathematical statistics.
[7]
M Mazumdar,et al.
Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments.
,
2000,
Statistics in medicine.
[8]
N. Jaspen.
Applied Nonparametric Statistics
,
1979
.