Content Reviewers:Rishi Desai, MD, MPH
Correlation is a statistical technique that shows whether two quantitative variables are related, and also how strongly they’re related.
For example, let’s say you want to figure out if drinking more soda is correlated with having a higher body mass index or BMI.
So, you ask 100 people how many sugary beverages they drink in a week and then check each person’s height and weight to calculate their BMI.
You could plot these measurements or data points on a scatterplot, with the number of beverages on the x-axis and BMI on the y-axis, and where each data point represents one individual.
Typically, a trendline is drawn to best represent the pattern of data points on the plot, with roughly half the points above the line and half the points below the line.
Now, a positive correlation means that BMI increases as the number of beverages increases, and, if the two variables have a perfect positive correlation, then the trendline will pass through every single data point.
Now imagine that there’s a negative correlation. That means that the BMI decreases as the number of beverages increases, and, with a perfect negative correlation, the trendline also passes through every data point.
Finally, if there’s no correlation, then the data points will be randomly spread out all over the scatterplot, and the trendline will be flat with no positive or negative direction.
To figure out how strongly two variables are correlated, we can use the Pearson’s correlation test, which is a parametric test that measures how close or spread out the data points are from the trendline.
The Pearson’s correlation test calculates a correlation coefficient, which is a number that represents how well the two variables are correlated, and usually it’s written with a lowercase r.
The correlation coefficient can range from negative 1 - which represents perfect negative correlation - to positive 1 - which represents perfect positive correlation.
A correlation coefficient of 0 means that there’s no correlation between the two variables.
Now, there are four key assumptions that are used in a Pearson’s correlation test, and an acronym for these assumptions is L-I-N-E, or LINE.
First, the relationship between the two variables has to be Linear, which means that the trendline drawn to represent the data points is a straight line.
Relationships that have a different type of curve, like in an exponential relationship or a U-shaped relationship, will have a low correlation coefficient because a straight trendline doesn’t match the shape of the data points.
The second assumption is that each individual in the sample was recruited Independently from other individuals in the sample.
In other words, no individuals influenced whether or not any other individual was included in the study.
For example, if one person agrees to be in the study only if their friend can also be included in the study, then these two individuals would not be independent of each other and the second assumption would not be met.
Independent recruitment ensures that the people included in the study have similar characteristics to the target population, and that the results of the test can be applied to the target population - meaning it has good external validity!
Additionally, the sample population must have been recruited randomly, like if you chose 100 names randomly from a list of all names in the target population.
Like independent recruitment, random sampling is important because it ensures that the sample population approximates the target population.
The third assumption is that the errors between the observed and predicted values of y are Normally distributed around a value of x.
That might sound confusing - no worries. Let’s break it down, by reviewing the trendline, which shows a predicted y-value for a specific x-value.
For example, let’s say we have a trendline that looks like this. According to our trendline, for people that drink 2 sugary beverages per week, we predict that their BMI will be around 23. But in reality, some people that drink 2 beverages will have a BMI that’s higher or lower than 23, simply because of individual differences, like how old a person is or how much a person exercises.
The distance a point is from the prediction line is called error, and points that are far from the line are said to have a high error.
Okay, so the third assumption says that the errors for a given point will follow a normally distributed bell curve, meaning most people will have low error, so their BMI will be clustered near the trendline; and fewer people will have high error, so their BMI will be higher or lower than the trendline.
And this is true for every value of x, or for each number of sugary beverages per week.
The fourth assumption is that the data points must have Equal variance, which is also called homoscedasticity. Basically, this means that the data points are equally spread out from the trendline for every value of x.
If the data points are closer to the trendline on one end and further away on the other end, then the assumption of equal variance is not met and the data are heteroscedastic.
So going back to the Pearson’s correlation test, it starts with two hypotheses.
The null hypothesis states that r equals zero, or in other words, there is no correlation between two variables.
And the alternate hypothesis is that r does not equal zero, or that there is a correlation between two variables.
To test these hypotheses, we have to calculate the correlation coefficient, and there are five steps for doing this.
The first step is to find the sum of each of the x-values, and in this example, that’s the number of beverages.
As a simple example, we’ll use a sample size of 5 people, and let’s say the number of beverages each person consumes is 3, 5, 8, 10, and 12. Oftentimes, the sum is written with the Greek letter sigma, so the sum of all the x-values is 3 plus 5 plus 8 plus 10 plus 12 equals 38. So our sample population of 5 people, drinks around 38 sugary beverages per week.
The second step is to find the sum of each of the y-values, and in this example, that’s the BMI for each person.
So, let’s say the five individuals had BMI of 24, 23, 28, 29, and 35, and when we add them all together we get 139. This means that our sample population of 5 people has a combined BMI of 139.
The third step is to find the sum of each of the squared values of x and the sum of each of the squared values of y.
So, when we square each x-value, we get 9, 25, 64, 100, and 144, and when we add them all together we get 342. When we square each y-value, we get 576, 529, 784, 841, and 1225, and when we add them all together we get 3995.
The fourth step is to find the sum of x times y.
To do this, we multiply the value of x for the first person by the value of y for the first person, so 3 times 24 is 72. Then, we repeat that for the other four people - so, 5 times 23 is 115, 8 times 28 is 224, 10 times 29 is 290, and 12 times 35 is 420. Finally, we add those values together to get 1121.