Linear regression

Let’s say you want to figure out if smoking more cigarettes leads to a lower forced expiratory volume, or FEV, which is the total amount of air, in liters per second, that a person can exhale in a single forced breath.

And let’s say that healthy men generally have an FEV of around 4 liters per second.

Now, to figure out if men who smoke have a lower FEV, you might ask 100 men how many cigarettes they smoke in a day, and then measure each man’s FEV.

You could plot these measurements, or data points, on a scatterplot, with the number of cigarettes, which is the exposure, on the x-axis, and FEV, which is the outcome, on the y-axis, and where each data point represents one individual.

Typically, a linear trend line, or model, is drawn to represent the pattern of data points on the plot.

Theoretically, there are lots of lines that can be drawn to represent the data points, but the best trend line is the one with the smallest amount of error, which is a measurement of how far away an individual data point is from the trend line.

Usually, we look at the squared error, which is the distance between the data point and the line, squared.

For example, if a data point is very close to the trend line, then the squared error is small.

On the other hand, if a data point is very far from the trend line, then the squared error is large.

If you add up all the squared error for a line, you get the total squared error, and a line with a smaller total squared error is considered a better fit for the data than a line with a larger total squared error.

Now, when two variables are linearly related, we might want to know specifically what happens to the outcome variable when the exposure variable changes.

For example, we might want to know how a person’s FEV changes if they smoke five cigarettes per day compared to if they smoke ten cigarettes per day.

To figure this out, we have to use linear regression, which is a statistical method that calculates an equation for the best fitting trend line for a set of data points.

Specifically, in this case, we would use simple linear regression, since we only have two variables—one y-variable, which is the FEV measurement, and one x-variable, which is the number of cigarettes.

This is different from multiple linear regression, where there’s one y-variable and two or more x-variables.

Typically, multiple linear regression is used to control for confounding variables, or x-variables that distort the true relationship between the main exposure variable and the outcome variable.

For example, age is a confounding variable in the relationship between smoking and FEV, because older people tend to have a lower FEV than younger people.

So, let’s say you wanted to figure out how FEV changes for people who smoke more cigarettes per day, controlling for age.

In this example, there are two x-variables—the number of cigarettes a person smokes and a person’s age—so, you’d have to use multiple linear regression.

Now, there are four key assumptions that are used in linear regression, and an acronym for these assumptions is L-I-N-E, or LINE.

First, the relationship between the two variables has to be Linear, which means that the trendline drawn to represent the data points is a straight line.

Relationships that have a different type of curve, like in an exponential relationship or a U-shaped relationship, will have a low correlation coefficient because a straight trendline doesn’t match the shape of the data points.

The second assumption is that each individual in the sample was recruited independently from other individuals in the sample.

In other words, no individuals influenced whether or not any other individual was included in the study.

For example, if one person agrees to be in the study only if their friend can also be included in the study, then these two individuals would not be independent of each other and the second assumption would not be met.

Independent recruitment ensures that the people included in the study have similar characteristics to the target population, and that the results of the test can be applied to the target population - meaning it has good external validity!

Additionally, the sample population must have been recruited randomly, like if you chose 100 names randomly from a list of all names in the target population.

Like independent recruitment, random sampling is important because it ensures that the sample population approximates the target population.

The third assumption is that the errors between the observed and predicted values of y are Normally distributed around a value of x. That might sound confusing - no worries.

Let’s break it down, by reviewing the trendline, which shows a predicted y-value for a specific x-value.

For example, let’s say we have a trendline that looks like this. According to our trendline, for people that smoke 10 cigarettes per day, we predict that their FEV will be around 3 liters per second.

But in reality, some people that smoke 10 cigarettes per day will have an FEV that’s higher or lower than 3, simply because of individual differences, like how old a person is or how much a person exercises.

Now, remember that the distance a point is from the prediction line is called error, and points that are far from the line are said to have a high error.

Okay, so the third assumption says that the errors for a given point will follow a normally distributed bell curve, meaning most people will have low error, so their FEV will be clustered near the trendline; and fewer people will have high error, so their FEV will be higher or lower than the trendline. And this is true for every value of x, or for each number of cigarettes smoked per day.

The fourth assumption is that the data points must have Equal variance, which is also called homoscedasticity.

Basically, this means that the data points are equally spread out from the trendline for every value of x.

If the data points are closer to the trendline on one end and further away on the other end, then the assumption of equal variance is not met and the data are heteroscedastic.

Key Takeaways

Linear regression is a mathematical technique used to estimate the relationship between two variables. It is used when the relationship between the two variables is linear (meaning that it follows a straight line).

The linear regression equation looks like this: y = ax + b, where y is the predicted value, x is the independent variable (the one we are trying to predict), a is the slope of the line, and b is the y-intercept.

If we know the values of a and b, we can use the equation to predict the value of y for any given x. We can also use linear regression to determine how strong (or weak) the relationship between x and y actually is.

Linear regression

Videos

Notes