If you’ve ever seen a correlation coefficient, you’ve probably looked at the number and wondered, is that good? Is a correlation of -0.73 good but not a correlation of +0.58? Just what is a good correlation and what makes a correlation good?
The strength of the relationship between two variables is usually expressed by the Pearson Product Moment correlation coefficient, denoted by r. Pearson correlation coefficients range in value from -1.0 to +1.0, where:
-1.0 represents a perfect correlation in which all measured points fall on a line having a negative slope
0.0 represents absolutely no linear relationship between the variables
+1.0 represents a perfect correlation of points on a line having a positive slope.
If you have a dataset with more than one variable, you’ll want to look at correlation coefficients.
The Pearson correlation coefficient is used when both variables are measured on a continuous (i.e., interval or ratio) scale. There are several variations of the Pearson Product correlation coefficients. The multiple correlation coefficient, denoted by R, indicates the strength of the relationship between a dependent variable and two or more independent variables. The partial correlation coefficient indicates the strength of the relationship between a dependent variable and one or more independent variables with the effects of other independent variables held constant. The adjusted or shrunken correlation coefficient indicates the strength of a relationship between variables after correcting for the number of variables and the number of data points. There are also correlation coefficients for variables measured on noncontinuous scales. The Spearman R, for instance, is computed from ordinal-scale ranks.
Types of Correlation Coefficients.
So, what is a good correlation? It depends on who you ask.
- I once asked a chemist who was calibrating a laboratory instrument to a standard what value of the correlation coefficient she was looking for. “0.9 is too low. You need at least 0.98 or 0.99.” She got the number from a government guidance document.
- I once asked an engineer who was conducting a regression analysis of a treatment process what value of the correlation coefficient he was looking for. “Anything between 0.6 and 0.8 is acceptable.” His college professor told him this.
- I once asked a biologist who was conducting an ANOVA of the size of field mice living in contaminated versus pristine soils what value of the correlation coefficient he was looking for. He didn’t know, but his cutoff was 0.2 based on the smallest size difference his model could detect with the number of samples he had.
Is 0.2 a good correlation or does a good correlation have to be at least 0.6 or even 0.98? As it turns out, the chemist, the engineer, and the biologist were all right. Those correlations were all good for those uses. So, the meaningfulness of a correlation coefficient depends, in part, on the expectations of the person using it.
But how do you know what value of a correlation coefficient you should expect for it to be good? One answer is to look at the square of the correlation coefficient, called the coefficient of determination, R-square, or just R2. R-square is an estimate of the proportion of variance in the dependent variable that is accounted for by the independent variable(s). It is used commonly to interpret the strength of the relationship between variables and to compare alternative statistical models.
You might be able to decide how good your correlation is from a gut feel for how much of the variability you wanted a relationship to account for. For example, correlation coefficient values between approximately -0.3 and +0.3 account for less than 9 percent of the variance in the relationship between two variables, which might indicate a weak or non-existent relationship. Values between -0.3 and -0.6 or +0.3 and +0.6 account for 9 percent to 36 percent of the variance, which might indicate a weak to moderately strong relationship. Values between -0.6 and -0.8 or +0.6 and +0.8 account for 36 percent to 64 percent of the variance, which might indicate moderately strong to strong relationship. Values between -0.8 and -1.0 or +0.8 and +1.0 account for more than 64 percent of the variance, which might indicate very strong relationship.
That’s only part of the story, though. Two other things you have to do to decide if a correlation is good are plot the data and conduct a statistical test.
Plots—You should always plot the data used to calculate a correlation to ensure that the coefficient adequately represents the relationship. The magnitude of r is very sensitive to the presence of nonlinear trends and outliers. Nonlinear trends in the data cause the magnitude of the relationship to be underestimated. You can often use transformations to straighten any nonlinear patterns you see (http://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). Outliers (i.e., data values not representative of the population) that are located perpendicular to the data trend cause the relationship to be underestimated. Outliers parallel to the data trend cause the relationship to be overestimated.
Tests—Every calculated correlation coefficient is an estimate. The “real” value may be somewhat more or somewhat less. You can conduct a statistical test to determine if the correlation you calculated is different from zero. If it’s not, there is no evidence of a relationship between your variables. This test looks at the absolute value of the correlation coefficient and the number of data pairs used to calculate it. The larger the value of the correlation and the greater the number of data pairs, the more likely the correlation will be significantly different from zero. For example, a correlation of 0.5 would be significantly greater than zero based on about 11 data pairs but a correlation of 0.1 wouldn’t be significantly different from zero with 380 data pairs. That’s why all statistical software outputs the number of data pairs and the test probability with a correlation. With some software, you can also calculate a confidence interval around your estimate to see if the interval includes the value you set as a goal. But one way or the other, you have to consider the variability of your calculated estimate to decide if the correlation is good.
Correlation coefficients have a few other pitfalls to be aware of. For example, the value of a multiple or partial correlation coefficient may not necessarily meet your definition of a good correlation even if it is significantly different from zero. That’s because the calculated values will tend to be inflated if there are many variables but only a few data pairs, hence the need for that shrunken correlation coefficient. Then there’s the paradox that a large correlation isn’t necessarily a good thing. If you are developing a statistical model and find that your predictor variables are highly correlated with your dependent variable, that’s great. But if you find that your predictor variables are highly correlated with each other, that’s not good, and you’ll have to deal with this multicollinearity in your analysis. Finally, if you’re calculating many correlation coefficients from a large data set, you might find that the number of data pairs is different for each calculation because of missing data. Some statisticians believe it is acceptable to compare correlations calculated with different numbers of data pairs and other statisticians believe it is unwarranted, nonsensical, dishonest, fraudulent, heinous, and sickeningly evil.
What to Look for in Correlations.
What makes a good correlation, then, depends on what your expectations are, the value of the estimate, whether the estimate is significantly different from zero, and whether the data pairs form a linear pattern without any unrepresentative outliers. You have to consider correlations on a case-by-case basis. Remember too, though, that “no relationship” may also be an important finding.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.