# Methods of regression analysis

Videos

Notes

## Biostatistics and epidemiology

#### Biostatistics

#### Content Reviewers:

Rishi Desai, MD, MPH#### Contributors:

Elizabeth Nixon-Shapiro, Kaitlyn Harper, Evan Debevec-McKenneyThere are four basic types of statistical analyses commonly used in epidemiological research, and the analysis you pick depends on two main criteria.

The first criterion is the type of data you have, which can be either individual data or binned data, which is also called group data.

So, for example, let’s say we want to know how many people out of 100 people developed lung cancer the past 5 years.

With individual data, we have information about each person, so we can tell whether or not each of the 100 people developed lung cancer.

So let’s say that 6 people developed lung cancer. If we have individual data, we can look at the individual characteristics for each of those 6 people, like their sex, age, race, or past history of migraines, and we can compare them to the people that didn’t developed lung cancer.

On the other hand, if we have group data, we don’t actually know which specific individuals out of the 100 people developed lung cancer.

So even though we know that 6 people had them, we don’t know which 6 people they were or any of their individual characteristics.

The second criterion is the type of outcome or y-variable you’re measuring, which can be either quantitative, categorical, or time to event.

Quantitative variables have a numeric value, like a person’s forced expiratory volume, which is the total amount of air, in liters, that a person can exhale in a single forced breath.

A very fit person might have an FEV of 5, while a less fit person might have an FEV of 3.

On the other hand, categorical variables have distinct levels.

For example, we could use a categorical variable to characterize if a person was diagnosed with lung cancer in the past five years or if they were not.

And finally, time to event variables describe how long a person was followed before the event or outcome occurred.

For example, if we started following a person at age 50 and they developed lung cancer at age 53, then their time to event would be 3 years.

Now, one of the simplest and most widely used types of analysis is linear regression.

Linear regression uses individual data, and the outcome variable is always quantitative, while the exposure variable can be either categorical or quantitative.

For example, let’s say we want to figure out if there’s an association between the number of cigarettes smoked and FEV, so we ask 100 people how many cigarettes they smoke in a day and then measure each person’s FEV. In this study, the exposure is the number of cigarettes, so it’s quantitative, and the outcome is FEV, which is also quantitative.

Typically, we use statistical software to calculate the linear equation, and the software will provide b0 and b1, which are two numbers we can then plug into the equation y-hat = b0 + b1x1.

Y-hat is the estimated value for the outcome variable, which in this case is FEV, and x1 is the value of the exposure variable, so in this case that’s the number of cigarettes a person smokes.

So let’s say the software gives us a b0 of 4 and a b1 of negative 0.1, so the equation is y-hat equals 4 minus 0.1 times x1.

Now, b1 is the most important number for interpretation because it tells us the effect size, or how much the outcome variable changes for every one-unit increase in the exposure variable.

For example, a b1 of negative 0.1 means that, on average, the FEV will decrease by 0.1 liters per second for every one additional cigarette smoked per day.

One important thing to know is that linear regression can be used in any type of study design as long as the two criteria of individual data and quantitative outcome variable are met.

The next type of statistical analysis is logistic regression. Logistic regression uses individual data, and the outcome variable is always categorical while the exposure variables can be either categorical or quantitative.

For example, let’s say we want to figure out if smoking more cigarettes increases the chance of lung cancer between the ages of 55-64. So, we follow a hundred 55-year-olds that smoke and a hundred 55-year-olds that don’t smoke for 10 years, and compare how many of them develop lung cancer.

In this example, the exposure variable is whether or not a person smokes cigarettes, so it’s categorical; and the outcome variable is whether or not the person develops lung cancer, so it’s also categorical.

And more specifically, because there are only two levels for each variable, they’re called binary categorical variables.

Now, like linear regression, the statistical software will give us b0 and b1, and we can plug them into the same equation of y-hat = b0 + b1x1, but the interpretation of the beta-coefficients are different.

In logistic regression, the beta-coefficients represent the log-odds of the outcome occurring.

For example, let’s say the software gives us a b0 of 0.05 and a b1 of 1.9, so the equation for the line would be y-hat equals 0.05 plus 1.9 times x1.

If we only look at b1, the effect size, it tells us how much the log-odds of the outcome variable changes for the unexposed group, or the non-smokers, versus the exposed group, or the smokers.

So, a b1 of 1.9 means that, on average, the log-odds of developing lung cancer for smokers is 1.9 times the log-odds of developing lung cancer for non-smokers.

Since the log-odds can be a confusing interpretation, we can also convert these numbers to regular odds by exponentiating them by a base of e.

For example, e to the 1.9 equals 6.7, so the odds of developing lung cancer for smokers is 6.7 times the odds of developing lung cancer for non-smokers.

Logistic regression can be used for any type of study, but the interpretation changes slightly depending on the study design.

Our example was a longitudinal cohort study, because we had a group of exposed individuals—those are the ones that smoked—and a group of unexposed individuals—those are the ones that didn’t smoke—and followed them over time.

This type of study design allows you to measure the incidence or the risk, which is the number of new cases that occur over a certain period of time.

Using logistic regression, we then calculate what’s called the risk odds ratio.

On the other hand, logistic regression can also be used in case-control studies, which is where you compare the history of two groups of people—those that have a certain outcome, called cases, and those that don’t have a certain outcome, called controls—to see if they’ve been exposed to different things.

So, for example, we could’ve looked at 100 people that had lung cancer, which would be the cases, and 100 people that don’t have lung cancer, which would be the controls, and then compare how many people in each group smoked cigarettes in the past ten years.

Now, in case-control studies, we can’t measure the incidence, since we’re selecting people that already have the outcome.

Instead, we’re measuring the prevalence, or the number of people that already smoked cigarettes before we started measuring them.

In case-control studies, we can use logistic regression to then calculate the prevalence odds ratio.