Secrets of Good Correlations

If you’ve ever seen a correlation coefficient, you’ve probably looked at the number and wondered, is that good? Is a correlation of -0.73 good but not a correlation of +0.58? Just what is a good correlation and what makes a correlation good?

Negative Feline Correlation

The strength of the relationship between two variables is usually expressed by the Pearson Product Moment correlation coefficient, denoted by r. Pearson correlation coefficients range in value from -1.0 to +1.0, where:

-1.0 represents a perfect correlation in which all measured points fall on a line having a negative slope

No Feline Correlation

0.0 represents absolutely no linear relationship between the variables

+1.0 represents a perfect correlation of points on a line having a positive slope.

Positive Feline Correlation

If you have a dataset with more than one variable, you’ll want to look at correlation coefficients.

The Pearson correlation coefficient is used when both variables are measured on a continuous (i.e., interval or ratio) scale. There are several variations of the Pearson Product correlation coefficients. The multiple correlation coefficient, denoted by R, indicates the strength of the relationship between a dependent variable and two or more independent variables. The partial correlation coefficient indicates the strength of the relationship between a dependent variable and one or more independent variables with the effects of other independent variables held constant. The adjusted or shrunken correlation coefficient indicates the strength of a relationship between variables after correcting for the number of variables and the number of data points. There are also correlation coefficients for variables measured on noncontinuous scales. The Spearman R, for instance, is computed from ordinal-scale ranks.

Types of Correlation Coefficients.

So, what is a good correlation? It depends on who you ask.

  • I once asked a chemist who was calibrating a laboratory instrument to a standard what value of the correlation coefficient she was looking for. “0.9 is too low. You need at least 0.98 or 0.99.” She got the number from a government guidance document.
  • I once asked an engineer who was conducting a regression analysis of a treatment process what value of the correlation coefficient he was looking for. “Anything between 0.6 and 0.8 is acceptable.” His college professor told him this.
  • I once asked a biologist who was conducting an ANOVA of the size of field mice living in contaminated versus pristine soils what value of the correlation coefficient he was looking for. He didn’t know, but his cutoff was 0.2 based on the smallest size difference his model could detect with the number of samples he had.

Is 0.2 a good correlation or does a good correlation have to be at least 0.6 or even 0.98? As it turns out, the chemist, the engineer, and the biologist were all right. Those correlations were all good for those uses. So, the meaningfulness of a correlation coefficient depends, in part, on the expectations of the person using it.

But how do you know what value of a correlation coefficient you should expect for it to be good? One answer is to look at the square of the correlation coefficient, called the coefficient of determination, R-square, or just R2. R-square is an estimate of the proportion of variance in the dependent variable that is accounted for by the independent variable(s). It is used commonly to interpret the strength of the relationship between variables and to compare alternative statistical models.

You might be able to decide how good your correlation is from a gut feel for how much of the variability you wanted a relationship to account for. For example, correlation coefficient values between approximately -0.3 and +0.3 account for less than 9 percent of the variance in the relationship between two variables, which might indicate a weak or non-existent relationship. Values between -0.3 and -0.6 or +0.3 and +0.6 account for 9 percent to 36 percent of the variance, which might indicate a weak to moderately strong relationship. Values between -0.6 and -0.8 or +0.6 and +0.8 account for 36 percent to 64 percent of the variance, which might indicate moderately strong to strong relationship. Values between -0.8 and -1.0 or +0.8 and +1.0 account for more than 64 percent of the variance, which might indicate very strong relationship.

That’s only part of the story, though. Two other things you have to do to decide if a correlation is good are plot the data and conduct a statistical test.

Plots—You should always plot the data used to calculate a correlation to ensure that the coefficient adequately represents the relationship. The magnitude of r is very sensitive to the presence of nonlinear trends and outliers. Nonlinear trends in the data cause the magnitude of the relationship to be underestimated. You can often use transformations to straighten any nonlinear patterns you see (http://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). Outliers (i.e., data values not representative of the population) that are located perpendicular to the data trend cause the relationship to be underestimated. Outliers parallel to the data trend cause the relationship to be overestimated.

Tests—Every calculated correlation coefficient is an estimate. The “real” value may be somewhat more or somewhat less. You can conduct a statistical test to determine if the correlation you calculated is different from zero. If it’s not, there is no evidence of a relationship between your variables. This test looks at the absolute value of the correlation coefficient and the number of data pairs used to calculate it. The larger the value of the correlation and the greater the number of data pairs, the more likely the correlation will be significantly different from zero. For example, a correlation of 0.5 would be significantly greater than zero based on about 11 data pairs but a correlation of 0.1 wouldn’t be significantly different from zero with 380 data pairs. That’s why all statistical software outputs the number of data pairs and the test probability with a correlation. With some software, you can also calculate a confidence interval around your estimate to see if the interval includes the value you set as a goal. But one way or the other, you have to consider the variability of your calculated estimate to decide if the correlation is good.

Correlation coefficients have a few other pitfalls to be aware of. For example, the value of a multiple or partial correlation coefficient may not necessarily meet your definition of a good correlation even if it is significantly different from zero. That’s because the calculated values will tend to be inflated if there are many variables but only a few data pairs, hence the need for that shrunken correlation coefficient. Then there’s the paradox that a large correlation isn’t necessarily a good thing. If you are developing a statistical model and find that your predictor variables are highly correlated with your dependent variable, that’s great. But if you find that your predictor variables are highly correlated with each other, that’s not good, and you’ll have to deal with this multicollinearity in your analysis. Finally, if you’re calculating many correlation coefficients from a large data set, you might find that the number of data pairs is different for each calculation because of missing data. Some statisticians believe it is acceptable to compare correlations calculated with different numbers of data pairs and other statisticians believe it is unwarranted, nonsensical, dishonest, fraudulent, heinous, and sickeningly evil.

What to Look for in Correlations.

What makes a good correlation, then, depends on what your expectations are, the value of the estimate, whether the estimate is significantly different from zero, and whether the data pairs form a linear pattern without any unrepresentative outliers. You have to consider correlations on a case-by-case basis. Remember too, though, that “no relationship” may also be an important finding.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

About these ads

About statswithcats

Charlie Kufs has been crunching numbers for over thirty years. He currently works as a statistician.
This entry was posted in Uncategorized and tagged , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

21 Responses to Secrets of Good Correlations

  1. 泳鏡 says:

    Superb posting, I share the same views. I wonder why this particular world truly does not picture for a moment like me and also the blog site creator :D

  2. 泳鏡 says:

    Actually genuinely great weblog article which has received me considering. I by no means looked at this from the stage of look at.

  3. Hey can I copy and paste this post on my web site? What references must I give? You might give this info for other people too.

  4. You must participate in a contest for probably the greatest blogs on the web. I’ll recommend this web site!

  5. Ellie K says:

    I regret that I have no handbags, shoes or other products to offer in my comment. However, I did enjoy your
    Taxonomy of Correlation Coefficients Table
    very much! This was my very first exposure to ordinal, even binary flavored correlation coefficients.

    The table formatted comparisons were particularly useful. Thank you.

  6. Pingback: The Best Super Power of All | Stats With Cats Blog

  7. Rajesh Chaudhary says:

    Great article dude. Nicely explained. Thanks for sharing. I will be visiting this page soon again. :)

  8. I got here via google ... user 1 says:

    Thank you. That was useful, especially the different sciences vs. acceptable value.

  9. SN says:

    Hi Charlie,
    This post is really useful for me as I am cracking my head on the interpretation of correlation coefficient. Like what you have mentioned, different people have different views on good correlations and at last I am confused. I need to put down the references on my scientific paper, so would you mind to share with me literatures that you referred to for the interpretation of correlation coefficient on “correlation coefficient values between approximately -0.3 and +0.3 account for less than 9 percent of the variance in the relationship between two variables, which might indicate a weak or non-existent relationship. Values between -0.3 and -0.6 or +0.3 and +0.6 account for 9 percent to 36 percent of the variance, which might indicate a weak to moderately strong relationship. Values between -0.6 and -0.8 or +0.6 and +0.8 account for 36 percent to 64 percent of the variance, which might indicate moderately strong to strong relationship. Values between -0.8 and -1.0 or +0.8 and +1.0 account for more than 64 percent of the variance, which might indicate very strong relationship.”
    Much appreciated.

    sn

  10. Pingback: GGPLOT Graphs for NFL Stats « Hearing the Oracle

  11. stolzyblog says:

    Thanks for this useful post. I included a link to you from this recent post:
    http://hearingtheoracle.com/2013/07/02/another-r-layering-example/
    Just FYI.

  12. Pingback: NFL Nerds 2013 : Week 1 « Hearing the Oracle

  13. Maria Memy says:

    Hello Charlie

    I too enjoyed the cat pictures! I have a question which may indicate how little I remember statistics from my College years. I work in a clinical laboratory. We are requiered to “correlate” analyzers which produce the same type of resulsts. I came across your blog by trying to get information about what the lead way should be. I know 10% agreement is ormally used, but I wasnt sure if 20% would be too much. I am familiar with Coefficient Variation, however we use % agreement between devices. Every time I “google” correlation, I get coefficient variation. HELP!!!

  14. Pingback: Visualizing Airport Delay Correlations with Google BigQuery and the Maps API | Directory Net

  15. Pingback: Visualizing airport delay correlations with Google BigQuery and Maps API | InfoLogs

  16. box says:

    Hmm is anyone else having problems with the images on this blog loading?
    I’m trying to find out if its a problem on my end or if it’s the blog.
    Any feedback would be greatly appreciated.

  17. Anne says:

    Just wanted to say thanks for making someone temporarily frustrated with a data situation smile!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s