# Introductory Biostatistics Notes

### Osmosis High-Yield Notes

This Osmosis High-Yield Note provides an overview of Introductory Biostatistics essentials. All Osmosis Notes are clearly laid-out and contain striking images, tables, and diagrams to help visual learners understand complex topics quickly and efficiently. Find more information about Introductory Biostatistics:

Introduction to biostatistics

Mean, median, and mode

Probability

Range, variance, and standard deviation

Types of data

NOTES NOTES INTRODUCTORY BIOSTATISTICS INTRODUCTION TO BIOSTATISTICS osms.it/intro-biostatistics ▪ Selection bias: sample does not accurately reﬂect population ▫ Occurs when precautions to obtain representative sample are not used ▫ Randomization helps eliminate bias ▪ Statistics: process of collecting, organizing, analyzing data set variables ▪ Biostatistics: focus on data related to living things ▪ Descriptive statistics: summarizes, describes population information ▪ Inferential statistics: examines relationships between two/more variables → applies results of sample population to target population Case (data point) ▪ Single observation (e.g. one individual visiting emergency room for inﬂuenza symptoms) POPULATION & SAMPLE TYPES OF HYPOTHESES Population ▪ Group (people, specimens, events) with deﬁned criteria (e.g. October–March emergency room visits) ▪ Parameter: numerical population description (e.g. range, mean, standard deviation) ▫ μ = population mean ▫ σ = population standard deviation Null hypothesis (H0) ▪ States that there is no relationship between variables ▪ Any observed relationship due to chance (e.g. no relationship between body mass index (BMI), hypertension) Sample ▪ Subset drawn from population (e.g. inﬂuenza-related October–March emergency room visits) ▪ Represents population → inferences can be made about population ▪ Statistic: numerical sample description (e.g. range, mean, standard deviation) ▪ X = sample mean ▪ SD = sample standard deviation ▪ Sampling error: sample does not accurately reﬂect population ▫ Usually due to wide variation within sample ▫ ↑ sample size helps avoid sampling error Alternative hypothesis (research hypothesis) ▪ States expected relationship between variables (e.g. relationship between BMI, hypertension) Hypothesis testing ▪ Statistical methods used to determine relationship strength between variables, how much of observed relationship is due to chance, signiﬁcance of observations ▪ Statistical signiﬁcance: relationship between variables is caused by something other than chance ▪ Usually deﬁned by a p-value of < 0.05 (5%); “p” stands for “probability” ▫ Type 1 error: probability of incorrectly rejecting null hypothesis (i.e. concluding signiﬁcant relationship between OSMOSIS.ORG 41
variables when there is not) ▫ Type 2 error: incorrectly accepting null hypothesis (i.e. concluding there is no signiﬁcant relationship between variables, missing present association) ▪ Clinical signiﬁcance: practical importance of study results that may not be statistically signiﬁcant RELIABILITY & VALIDITY ▪ Measurement characteristics used to collect data Validity: accuracy ▪ Instrument actually measures variable (concept, construct) it is supposed to measure (e.g. urine dipstick accurately detects proteinuria) ▪ Valid instrument must be reliable Reliability: repeatability ▪ Instrument consistently yields same results with repeated measurements (e.g. urine dipstick reliably detects proteinuria with each measurement) ▪ Reliable instrument may/may not be valid TYPES OF VARIABLES ▪ Variable: deﬁned characteristic being studied; can assume different values ▪ Independent variable: manipulated (treatment) variable ▪ Dependent variable: outcome variable; inﬂuenced by independent variable ▫ What is effect of X (independent variable) on Y (dependent variable); how is X related to Y? ▫ E.g. what is the effect of lipid-lowering drug (X) on individual’s cholesterol level (Y)? GRAPHIC DESCRIPTION OF DATA ▪ When values are plotted on graph → variety of frequency distributions (curves) result ▪ Properties of distributions: central tendency, dispersion Normal (Gaussian) curve ▪ Symmetrical distribution of scores around mean ▫ Forms classic bell shape ▫ Values lie within two standard deviations of mean ▫ Most natural phenomena show this type of distribution ▫ Parametric tests utilized in research Non-Gaussian curve ▪ Asymmetrical distribution of scores around mean ▫ Skewed (negatively/positively) curve ▫ Kurtotic (ﬂat/peaked) curve (leptokurtic—thin, positive kurtosis; platykurtic—ﬂat negative kurtosis) ▫ Nonparametric tests utilized in research Figure 5.1 Visualization of normal (red), skewed (green) and kurtotic (blue and yellow) distributions. 42 OSMOSIS.ORG
Chapter 5 Biostatistics & Epidemiology: Introductory Biostatistics MEAN, MEDIAN, MODE osms.it/mean-median-mode ▪ Central tendency measures ▪ More curve symmetry → more alike mean, median, mode Mean (X) ▪ Central value calculated by adding each value in data set → dividing by total number of data points ▪ Expressed as formula: total sum of individual data points X1, X2, ......., Xn, divided by n (number of data points) X= Mode ▪ Central value appearing most often in data sequence ▫ Bimodal (two modal), trimodal (three modes), amodal 17 19 20 20 21 21 22 100 ▫ Bimodal dataset with two mode values of 20, 21 ▪ Not affected by outliers ( X 1 + X 2 ,...,+ X n ) n 17 + 19 + 20 + 20 + 21+ 21+ 22 140 = = 20 7 7 ▪ Can be inﬂuenced by an extreme value (outlier) → skewed data Median ▪ Calculates central value when possible outliers present ▪ Divides set of data into two halves ▫ Half of values > median, half < median ▪ Most commonly used expression of central tendency ▪ Arrange data in order of magnitude → ﬁnd midpoint 17 19 20 20 21 21 22 100 ▪ Odd number of values → one “middle” number ▪ Even number of values → two middlevalues values (20, 21) ▫ Calculate median by averaging two values: (20+21)/2 = 20.5 Figure 5.2 Mean, median, and mode in a skewed curve. OSMOSIS.ORG 43
PROBABILITY osms.it/probability ▪ Relative likelihood that event will/will not occur ▪ To calculate chance that event/outcome will occur → divide number of times event happened by number of times event could have happened ▫ E.g. event A is rolling a die and getting a three ▫ Since a die has six sides, there are six possible numbers, so the probability (P) of rolling a three is 1/ 6, or 0.167 (16.7%) Figure 5.5 Probability of not rolling a three = 1 - P(rolling a three). Rule 4 ▪ Probability of two disjoint (mutually exclusive) events = the sum of the ﬁrst event plus the second event ▫ P(A or B) = P(A) + P(B) Figure 5.3 Probability of rolling a three on a six-sided die. RULES Rule 1 ▪ Probability of event A can range anywhere from 0% to 100% ▫ 0 ≤ P(A) ≤ 1 Rule 2 ▪ Sum of probabilities of all possible outcomes = 1 Figure 5.4 Visualization of Rule 2. Rule 3 (complement rule) ▪ Probability that event will not occur = 1 minus probability that it does occur ▫ P = 1 – P(A) 44 OSMOSIS.ORG Rule 5 ▪ Probability for two not disjoint (not mutually exclusive) events = sum of the probability of event A and the probability of event B, minus the probability of event A and B together ▫ P(A or B) = P(A) + P(B) – P(A and B) Rule 6 ▪ Probability of two independent events = probability of the ﬁrst event multiplied by the probability of the second event ▫ P(A and B) = P(A) x P(B) Rule 7 ▪ Conditional probability (probability of event A, given what happens in event B) = probability of event A and event B divided by probability of event B Rule 8 ▪ Probability of events A, B = probability of event A multiplied by conditional probability of event B given event A occurred
Chapter 5 Biostatistics & Epidemiology: Introductory Biostatistics Figure 5.6 A visualization of the difference between mutually exclusive and not mutually exclusive events. Figure 5.7 Rule 7, conditional probability: determining P(A) and P(B) when event A depends on event B. In this case, we are ﬁnding the probability that the roll of two dice adds up to seven (event A) given that the ﬁrst die is either a ﬁve or a six (event B). Once P(A) and P(B) are known, they are used to solve for P(A given B). OSMOSIS.ORG 45
RANGE, VARIANCE, & STANDARD DEVIATION osms.it/range-variance-standard-deviation ▪ Measures distribution of variables Range ▪ Difference between highest, lowest value ▪ E.g. Range of individuals’ cholesterol levels ▫ 130, 150, 152, 158, 165, 289, 354 ▫ Range 354 - 130 = 224mg/dL ▪ E.g. individual weight (in kg) ▫ 10 + 45 + 50 + 55 + 90 ▫ Range = 90 - 10 = 80 Variance ▪ Sum of squared deviations from mean, divided by number of distributions ∑(x − x)2 σ = n 2 ▪ E.g. variance of individual weight (in kg) ▫ (10 - 50)2 + (45 - 50)2 + (50 - 50)2 + (55 - 50)2 + (90 - 50)2 / (5) = 650 kg2 Standard deviation (SD) ▪ Square root of variance σ= 46 OSMOSIS.ORG Σ(x − x)2 n ▫ E.g. SD of individual weight: √650 = 25.5kg ▪ In Gaussian curve ▫ 68 - 95 - 99 rule: 68% of data points lie within 1 SD from mean; 95% lie within 2 SD, 99% lie within 3 SD ▪ Z-score = number of SD data point is away from mean ▫ Data point minus the population mean, divided by the population standard deviation x−µ σ ▫ E.g. blood glucose population mean = 90g/dL, SD = 20g/dL, data point = 130g/ dL (130 - 90 / 20 = 2) ▪ Coefﬁcient of variation (CV) = SD/mean; also expressed as percentage, obtained by multiplying the CV by 100
Chapter 5 Biostatistics & Epidemiology: Introductory Biostatistics TYPES OF DATA osms.it/types-of-data ▪ Determining type of data to be collected helps establish which sort of distributions can logically be used to describe variable Nominal data ▪ Can assume one of a limited number of possible values (e.g. ABO blood types) ▫ No meaningful rank order; no median, mean, standard deviation; mode used for analysis ▫ Includes dichotomous variables (e.g. normal, abnormal) Ordinal data ▪ Ordered in meaningful way (e.g. systolic murmur ranking from 1–6) ▫ Follows order, but quantitative differences not clear (do not indicate degree of difference between observations) ▫ Median, mode can be used; mean usually not suitable to describe sample/ population Continuous data ▪ Can take on inﬁnite number of value (e.g. weight, height, blood glucose) ▫ Mean, median, mode, standard deviation can be calculated Interval data ▪ Indicates meaningful quantitative difference between two values; values can be placed in clear, logical order ▫ E.g. temperature on Celsius/Fahrenheit scale; difference between 90° and 60° measured as 30° ▫ Arbitrary zero point ▫ Mean, median, mode, standard deviation can be calculated Ratio data ▪ Has absolute, meaningful zero point ▪ Can use multiplication, addition, subtraction to calculate ratios ▪ Mean, median, mode using ratio data Discrete data ▪ Measured in whole numbers (no decimal values) ▫ E.g. number of pregnancies Figure 5.8 Types of data. OSMOSIS.ORG 47

### Osmosis High-Yield Notes

This Osmosis High-Yield Note provides an overview of Introductory Biostatistics essentials. All Osmosis Notes are clearly laid-out and contain striking images, tables, and diagrams to help visual learners understand complex topics quickly and efficiently. Find more information about Introductory Biostatistics by visiting the associated Learn Page.