Linear regression

4,354views

00:00 / 00:00

High Yield Notes

6 pages

Flashcards

Linear regression

of complete

Questions

USMLE® Step 1 style questions USMLE

of complete

USMLE® Step 2 style questions USMLE

of complete

A primary care physician is studying the relationship between a patient’s height in centimeters and body weight in kilograms. Data from 300 patients were obtained and found to have a linear relation. The following regression model was obtained:  

y(x) = -130 + 1.2x  
r-coefficient = 0.85  

According to this model, which of the following would be the most likely body weight for an individual 180 cm in height?  

External Links

Transcript

Watch video only

Content Reviewers

Contributors

Let’s say you want to figure out if smoking more cigarettes leads to a lower forced expiratory volume, or FEV, which is the total amount of air, in liters per second, that a person can exhale in a single forced breath.

And let’s say that healthy men generally have an FEV of around 4 liters per second.

Now, to figure out if men who smoke have a lower FEV, you might ask 100 men how many cigarettes they smoke in a day, and then measure each man’s FEV.

You could plot these measurements, or data points, on a scatterplot, with the number of cigarettes, which is the exposure, on the x-axis, and FEV, which is the outcome, on the y-axis, and where each data point represents one individual.

Typically, a linear trend line, or model, is drawn to represent the pattern of data points on the plot.

Theoretically, there are lots of lines that can be drawn to represent the data points, but the best trend line is the one with the smallest amount of error, which is a measurement of how far away an individual data point is from the trend line.

Usually, we look at the squared error, which is the distance between the data point and the line, squared.

For example, if a data point is very close to the trend line, then the squared error is small.

On the other hand, if a data point is very far from the trend line, then the squared error is large.

If you add up all the squared error for a line, you get the total squared error, and a line with a smaller total squared error is considered a better fit for the data than a line with a larger total squared error.

Now, when two variables are linearly related, we might want to know specifically what happens to the outcome variable when the exposure variable changes.

For example, we might want to know how a person’s FEV changes if they smoke five cigarettes per day compared to if they smoke ten cigarettes per day.

To figure this out, we have to use linear regression, which is a statistical method that calculates an equation for the best fitting trend line for a set of data points.

Specifically, in this case, we would use simple linear regression, since we only have two variables—one y-variable, which is the FEV measurement, and one x-variable, which is the number of cigarettes.

This is different from multiple linear regression, where there’s one y-variable and two or more x-variables.

Typically, multiple linear regression is used to control for confounding variables, or x-variables that distort the true relationship between the main exposure variable and the outcome variable.

For example, age is a confounding variable in the relationship between smoking and FEV, because older people tend to have a lower FEV than younger people.

Summary

Linear regression is a mathematical technique used to estimate the relationship between two variables. It is used when the relationship between the two variables is linear (meaning that it follows a straight line).

The linear regression equation looks like this: y = ax + b, where y is the predicted value, x is the independent variable (the one we are trying to predict), a is the slope of the line, and b is the y-intercept.

If we know the values of a and b, we can use the equation to predict the value of y for any given x. We can also use linear regression to determine how strong (or weak) the relationship between x and y actually is.