Summary of Logistic regression
Transcript for Logistic regression
Logistic regression is a type of statistical method that’s used to describe the relationship between an outcome variable and one or more exposure variables.
In logistic regression, the outcome variable is always categorical, and the exposure variables can be either categorical or quantitative.
For example, let’s say you want to figure out if smoking more cigarettes increases the chance of having a heart attack. In this case, the number of cigarettes is a quantitative exposure and whether or not a person has a heart attack is a categorical outcome.
Now, to figure this out, you might ask 200 people how many cigarettes they smoke in a day, and then follow that group of people for five years and see who has a heart attack and who doesn’t.
You could organize your data in a table like this—where the first column, or variable, is the number of cigarettes a person smokes, the second column is if they had a heart attack or not, and the rest of the columns are other characteristics, or variables, that you collected about each person, like their age, sex, and body mass index, or BMI.
Usually, for binary variables, like yes or no, we use the numbers zero and 1 to represent the two possible answers.
So, for the heart attack variable, we might say that zero represents “no” and 1 represents “yes”. We could do the same thing for sex, where zero represents females and 1 represents males.
Now, let’s just look at the first two variables, so how many cigarettes they smoke and if they had a heart attack or not. You could plot these measurements, or data points, on a scatterplot, with the number of cigarettes on the x-axis, and heart attack on the y-axis, and where each data point represents one individual.
This scatterplot might seem a little funny looking, and that’s because all of the data points are clustered on two points on the y-axis—they’re either on the zero, which represents no, or the 1, which represents yes.
This scatterplot can help us figure out how the odds of having a heart attack changes for people as they smoke more and more cigarettes.
And that’s the goal of logistic regression.
Now, in statistics, probability and odds are often confused with one another, so let’s break down the difference.
The probability is the number of times an outcome happened divided by the number of times the outcome could have happened, and it’s often represented by a capital P.
So, using our data, we could figure out the probability of having a heart attack for each number of cigarettes smoked per day.
Let’s say the range for the number of cigarettes smoked is between zero and 19, so we can break up the scatterplot up into 20 different sections - and it’s 20 sections instead of just 19 because zero is also a section.
Now, to find the probability of having a heart attack in a specific section, we count up the number of people who had heart attacks in that section and divide it by the total number of people in that section.
As an example, let’s say there are 10 people in the 15-cigarette section - or in other words, there are 10 people in the study that smoke 15 cigarettes per day - and of those 10 people, 6 people had heart attacks. So, the probability of having a heart attack for people that smoke 15 cigarettes per day is 6 over 10, or 60%.
Now let’s switch gears and talk about odds. Odds compare the probability of an outcome occurring with the probability of an outcome not occurring, so the equation for odds is P divided by 1 minus P.
So, in our example, we’d use the probability of having a heart attack - which is 60%, or 0.6 - compared with the probability of not having a heart attack, which is 1 minus 0.6, or 0.4. If we divide both sides by 0.4 we get 1.5 compared to 1.
To make it easier to interpret, let’s multiply both sides by two, so we get a 3 to 2 ratio. In other words, out of all the people that smoke 15 cigarettes per day, there are 3 people that have heart attacks for every 2 people that don’t have heart attacks.
Okay, coming back to the scatterplot... typically, in linear regression, a linear trend line is drawn to represent the pattern of data points on the plot, and that line is represented by the equation: y-hat equals b0 plus b1x1, where y-hat is the estimated value for the outcome variable; x1 is the value of the exposure variable; b0 represents the y-intercept; and b1 represents the slope of the line.
But in logistic regression, the trend line looks a bit different. That’s because the data points for logistic regression aren’t arranged in a straight line, so a linear trend line isn’t a good fit, or representation, of the data.
Instead, the trend line for logistic regression is curved, and specifically, it’s an S-shaped curve. And the equation for this S-shaped curve is P equals e, raised to the power of b0 plus b1x1, divided by 1 plus e, raised to the power of b0 plus b1x1.
At this point, you might be wondering what trend lines have to do with probability and odds.
Well, the slope of a line at a specific point represents a certain probability - so a data point on a very steep slope represents a high probability and a data point on a very shallow slope represents a low probability.
So with an S-shaped curve, the probability of an occurrence - which in this case, is the probability of having a heart attack - is different at various points along the line, which makes it pretty hard to interpret.
For example, toward the bottom of the line, the probability is low, then toward the middle of the line, the probability is higher, and then at the top of the line, the probability is low again.
And this is a problem because typically, we want to draw a line that represents the whole sample population, so that we can make conclusions about the whole sample population.
To fix this problem, we have to transform, or change, our data points so that they’re arranged on the plot in a more linear pattern. To do this, we have to change our data from looking at probability to looking at odds, and we actually have to go one step further and change them to the logarithm of odds, or the log-odds.
So, if the basic equation for odds is P over 1 minus P, then to change our probability equation to an odds equation, we do P over 1 minus P equals - e, raised to the power of b0 plus b1x1 - over - 1 minus e, raised to the power of b0 plus b1x1.
If you simplify this equation - which we won’t show here, just for simplicity’s sake - what you end up with is P over 1 minus P equals e, raised to the power of b0 plus b1x1. Okay, so that’s the odds equation.
And to change the odds equation to the log-odds, we just take the log of both sides. So, we end up with log of P over 1 minus P equals b0 plus b1x1.
This equation, as you might recognize, is the equation for a linear trend line. And, in fact, if we look at the newly transformed data, we’ll see that it’s now linearly arranged on the plot.
Whew! Sometimes the math can seem a little confusing, but luckily, statistical software does most of that work for us!
Basically, the software spits out values for b0 and b1, which we can then plug into our equation. For example, let’s say the software gives us a b0 of 0.05 and a b1 of 0.2, so the equation for the line would be y-hat equals 0.05 plus 0.2 times x1.