# Introduction to biostatistics

Videos

Notes

## Biostatistics and epidemiology

#### Biostatistics

### AssessmentsIntroduction to biostatistics

### Flashcards

### Questions

### Introduction to biostatistics

### USMLE® Step 1 style questions USMLE

A randomized clinical trial is performed to study the effect of exercise and granulocyte-macrophage colony-stimulating factor (GM-CSF) on exercise tolerance in patients with peripheral arterial disease (PAD). There are 827 participants with a mean age of 67 years. The main outcome of the study was performance on the 6-minute treadmill walk test. Subjects were randomized to 4 groups including: exercise and GM-CSF, exercise alone, GM-CSF alone, and placebo alone; results are displayed in the table below. Small and large clinically meaningful changes are defined as +20 and +50 meters, respectively. Groups were compared using a 2-sample t test with alpha set to detect a 5% difference. Which of the following statements is true regarding the conclusions of this study?

### Introduction to biostatistics exam links

#### Content Reviewers:

Rishi Desai, MD, MPH, Yifan Xiao, MD#### Contributors:

Evan Debevec-McKenneyLet’s say you want to figure out if people with high body mass index, or BMI, are at a higher risk of hypertension - or high blood pressure.

Let’s say that you decide to go out and find 100 people with hypertension and 100 people without hypertension and find out the BMI of each person in each group.

You might also collect other information about the individuals in each group, like how old they are, if they smoke cigarettes, or if they drink alcohol, since all of these factors can influence a person’s risk of hypertension.

All of these different pieces of information - called variables - can be put together into a single document or file, called a data set.

A data set usually includes independent variables which are thought to influence or change dependent variables.

In our example, the body mass index would be the independent variable and hypertension would be the dependent variable.

The process of collecting, organizing, and analyzing variables in a data set is called statistics, and when the data were collected from living things - like humans, aardvarks, algae, or bacteria - it’s called biostatistics, bio meaning life.

Now, there are two main types of biostatistics.

The first type is descriptive statistics, which is used to describe or summarize information about each individual variable in the data set.

Descriptive statistics can be used to find the mean - the average number calculated from a particular variable, the median - the middle number in a variable, and the mode - the number that occurs the most in the variable.

The descriptive statistics of each variable can be calculated for the whole sample - all 200 people - or in each group separately - the 100 people in the group with hypertension or the other 100 people in the group without hypertension.

For example, we might find that the mean body mass index of all people in the study is 24.5, or that the mean body mass index is 28 for the group with hypertension and 21 for the group without hypertension.

We can also use descriptive statistics to find the range, variance, or standard deviation, all of which are ways of understanding how the data are spread out or distributed for a given variable.

For example, we might find that the lowest measured body mass index in the group with hypertension is 23, and the highest is 33, so the range for body mass index in this group is 23 to 33.

Typically, descriptive statistics are reported in a graph or a table.

The second type of biostatistics is inferential, which is different from descriptive statistics in two ways.

First, inferential statistics looks at relationships between two or more variables, instead of looking at each individual variable.

For example, we could use inferential statistics to explore the relationship between body mass index and hypertension.

We could categorize body mass index into two groups - above 25, or high, and below 25, or low - and we might find that people with high body mass indices have 3 times the odds of hypertension compared to people with low body mass indices.

Typically, inferential statistics are reported by relative risks, attributable risks, odds ratios, or hazard ratios.

The goal of descriptive statistics is to describe how similar or different the study groups in a particular sample population are to one another.

For example, let’s say we use descriptive statistics to find that 72% of people in the group with hypertension are male, but only 16% of people in the group without hypertension are male.

This is an important finding because men tend to have slightly lower body mass indices than women.

As a result, having more men in the group with hypertension, means that the average body mass index in that group will be lower.

Ultimately, if the descriptive statistics find that the study groups are not very similar, we say that the study has low internal validity, and that the results found by inferential statistics may be the result of differences in the two study groups.

On the other hand, the goal of inferential statistics is to apply the results of the sample population to a target population - which is usually just the general population.

So, inferential statistics is concerned about whether or not the two study groups are similar, as well as whether or not the sample population represents the target population.

Ideally, a study should be done on a sample population of individuals that is similar to that target population in every meaningful way.