ENG - Epidemiology Biostatistics IMG
ENG - Epidemiology Biostatistics IMG
ENG - Epidemiology Biostatistics IMG
Epidemiology
Epidemiology is the study of the distribution and determinants of health-related states and events in
specific population and the application of this study to the control of health problems.
• By Distribution we mean Time – Place – Person
• By Determinants we mean causes and factors that influence the risk of disease.
Epidemiological Studies
A) Descriptive Studies
1 - Case report and case series:
Results are not representative of the population. First step in a field or community investigation
2 - Studies with secondary data:
We use data that already exist to investigate and answer questions for which the original study was not
designed. The original study could be cross-sectional, case-control, cohort, or clinical trial.
These data could be individual data or grouped data.
Individual data: individual data for every member in a previous study, clinical history, death certificate,
Grouped data (Ecological study): These are data for groups, rather than individuals. Ecological studies
compare the frequency of events in different groups. Caution is required in drawing conclusions and
identifying associations.
B) Observational Studies
Analytic studies, etiologic studies, are performed to test specific hypothesis about a specific health
problem. In general, associations observed in descriptive studies are often the basis for gathering more
specific data and testing hypothesis in additional studies.
1
Advantages Disadvantages
Correct classification of exposure before disease Not suitable for rare diseases
develops. Time consuming (follow up)
Permits calculation of incidence rates. Losing people in follow up (Attrition)
A direct measure of relative risk, and Expensive
attributable risk. Status of subjects may be changed
Many possible outcomes to the same exposure Leading to error in classification of exposure
can be studied. e.g. Change in habit, occupation.
Suitable to study effect of rare exposures Administrative problems: loss of staff,
Accurate funding, high costs of study.
2
Reliability is the consistency and reproducibility of a test and absence of random variation. It shows how
precise is the test. Random error-reduces precision in a test. The ability of a test or combination of tests
to give consistent results in repeated applications, whether correct or incorrect.
Example; a nurse making repeat blood pressure measurements on an individual; or of the person
performing the test, ten different nurses measuring the blood pressure of the same individual.
Accuracy; It is the proportion of true test results (TP+TN) among all test results = (a+d) / (a+b+c+d)
Negative Predictive Value = TN / (TN + FN) Predictive value of a negative test result (NPV)
Likelihood that negative tests result will exclude the disease. NPV best reflects the FN rate of a test.
Tests with 100% sensitivity (No False Negatives) always have a NPV of 100%.
Positive Predictive Value = TP/ (TP + FP) Predictive value of a positive test result (PPV), Likelihood that a
positive test result will confirm the disease. PPV best reflect the FP rate of a test. Tests with 100%
specificity (No False Positives) always have a PPV of 100%.
3
Bayes' Theorem: Post-test Odds = Pre-test Odds x Likelihood Ratio
BIOSTATISTICS
1. Descriptive statistics:
Provide accurate description, presentation and organizations of data, examples are;
• Numbers/calculations used to summarize/describe set of data
• allow for quick analysis of your data by tabulating & graphing
• various indices (mean, variance)
• basis for inferential statistics
2. Inferential Statistics:
Generalizing from the group you have sampled to the population from which the sample was drawn.
We want to infer something about the population from the data you have collected and analyzed
Poll to look at political party affiliation
Treatment effect
Educational intervention
relationship between exercise and health
Population:
All possible observations/measurements of interest, may be hypothetical (and infinite)
Example: all 21 yr old males, all 21 yr old Canadian males
Population Parameter
numeric measures of the population
mean, standard deviation, proportion
usually not known because all observations cannot be made
Sample:
A portion of the population that is observed or measured
• subset of population
• representative sample allows for generalization
• size of sample is important, but not as important as representativeness
• size of a sample can never compensate for a biased sample
• biased = non-representative (selection bias)
Sample statistic
Any numeric measure computed from a subset (sample) of the population is called Sample Statistic
In research, we use the sample statistic to estimate the population parameter.
VARIABLE
Inferential & descriptive statistics concerned with variables
Anything that can vary (organism, environment, experimental treatment/situation)
Anything that is being observed, measured, categorized or manipulated
Examples: Gender, GPA, blood pressure, “satisfaction of life, number of cigarettes smoked/wk,
experimental treatment
TYPES OF VARIABLES
Independent Variable vs. Dependent Variable
Discrete Variable vs. Continuous Variable
Qualitative Variable vs. Quantitative Variable
The type of variable determines what kind of statistical procedure may be performed
4
Independent Variable
• the intervention
• what is being manipulated by the experimenter
• the variable the experimenter changes
• the experimenter’s variable of interest
• the “cause” or “what is responsible” for the observed effect
• must sometimes rely on existing variation (gender, lung cancer)
Example: treatment (drug, placebo) educational intervention (PBL, lecture)
Dependent Variable
• that which depends on the independent variable
• that which the independent variable influences
• outcome of interest
• changes in response to the independent variable
Continuous Variable
Data may take any value within a defined range; Blood pressure, height, weight, lung capacity
Quantitative data;
Interval Ratio
• ordered numbers along a continuum ordered along continuum
• difference between the numbers are • difference between numbers meaningful
meaningful • equal distance between each value
• equal distance between each value • meaningful zero point
• no meaningful zero point, It is arbitrary • can convert to ratio or percent
• cannot turn into ratio or percent Example: weight (lbs, Kg) Height ( m, ft.)
5
Example: IQ, temperature (C, F) A weight of 200lbs is twice as heavy as 100lbs
Person with IQ of 100 not twice as smart as A woman of 6 ft height is 1.5 times taller than a
person with IQ of 50. Temperature of 20o C is not woman with 4 ft.
twice as hot as 10o C
Mean
Mean is used with interval & ratio data, sometimes with ordinal e.g. rating/Likert scales, most familiar,
widely used and reflects all the data
Summation of values divided by total number of values;
Example : ( n=11 ) 8 3 6 4 11 2 9 4 10 11 4 = 8+3+6+4…….+11+4
11
Sample mean = Formula 6.55
Population mean= Formula
Lease-square criterion: Sum of deviations around mean is zero (Standard Normal)
∑(X −mean)=0 8-6.55 =1-5 + ….. 4-6.5 = -2.5
Median
The Middle point that divides the data into 2 equal sets after arrangement in order. Median is used for
ordinal, interval and ratio data. It reflects 50th percentile. It is not influenced by extreme values
Example (odd): N=11 8 3 6 4 11 2 9 4 10 11 4
Step 1: place in order: 2 3 4 4 4 6 8 9 10 11 11
Step 2: (N+1)/2= 6th position; Step 3: 6th position value = 6
Note: Median would remain unchanged if the lowest value was changed to 0 or the highest to 22!
Example 2 (even): N=12 8 3 6 4 11 2 9 4 10 11 4 9
Step 1: 2 3 4 4 4 6 8 9 9 10 11 11
Step 2: (N+1)/2 = 6.5 position, but no 6.5th position so between 6th and 7th
Step 3: median is arithmetic mean of 6th & 7th value 6+8/2= 7
Mode
The most frequent occurring value, used primarily for nominal (and ordinal data) could also be used for
continuous data, for continuous data, usually group data to calculate modal group. There may be more
than one mode or none
Examples: modal world nationality = Chinese; modal gender = female
Example : 8 3 6 4 11 2 9 4 10 11 4 Mode = 4
Note: if the highest or lowest value was changed drastically, it would have no effect on the mode
Range
It is used with ordinal, interval & ratio and calculated as difference between largest & smallest value
It is entirely dependent on the most extreme scores. Outliers/extreme scores have large effect on range.
Example: 8 3 6 4 11 2 9 4 10 11 4 ; Range: 11-2 = 9
Example: 8 3 6 4 11 2 9 4 10 11 4 Range: 11-2 = 9
Example: 8 3 6 4 11 2 9 4 10 19 4 Range: 19-2 =17
Data are identical except for one point
6
Interquartile Range (mid-spread)
It is used for ordinal, interval & ratio data and comprises the middle 50% of the data. It is a difference
between the 75th and 25th percentile. The IQR is not influenced by extreme scores as it disregards half of
the data (lower Q and upper Q)
Example: (N=10) 42 43 45 47 48 49 51 53 53 54
step 1: calculate Q1 (median of lower half); step 2: calculate Q3 (median of upper half) step 3: Q3-Q1
Q1= 45 (42 43 45 47 48) Q3= 53 (49 51 53 53 54) IQR= 8 (53-45)
Example: 42 43 45 47 48 49 51 53 53 54 IQR: 45-53 = 8 Range: 54-42 = 12
Example: 42 43 45 47 48 49 51 53 53 64 IQR: 45-53 = 8 Range: 42-64 = 22
Variance:
Variance is an extent to which observations vary around the mean. Therefore, larger the deviation,
greater will be variance. Sum of deviations around mean is always equal to zero. The positive differences
cancel out negative differences
• cannot use sum of mean differences as an index of variability
• could use the absolute value, but absolute values cannot be manipulated algebraically
• overcome by squaring the mean deviations
• the unit of measurement is not the same
Variance (population):
Variance (sample): As N increases, sample variance approximates population variance
Example: N = 11 sum (X-mean)2/n = s2
NB: stats pkg or spreadsheet, default is a sample variance.
Standard Deviation
• take the square root of the variance
• returns value to the original unit of measurement
• Easier to interpret
• Average deviation from the mean
• How much scores vary on average
• Generalizing from the group you have sampled to the population from which the sample was drawn
HYPOTHESIS
Null Hypothesis (Ho)
Hypothesis of no difference, there is no association between the Disease & Risk factor in the population
Alternate Hypothesis (H1)
Hypothesis that there is some difference, there is some association between the Disease & the Risk
factor in the population.
7
Type II Error (β) “False-Negative Error"
Stating that there is not an effect or difference when one exists (to fail to reject the null hypothesis
when it is in fact false).β is the probability of making a Type II Error.
The Power
The Power measure the probability of rejecting the Null Hypothesis when it is in fact False or the
likelihood of finding a difference if one in fact exists. The Power depends upon:
l. Total number of end points experienced by population
2. Difference in compliance between treatments groups (The mean values between groups)
3. Size of expected effect
Power ∞ Sample Size and Errors ∞ 1/ Sample Size
Lead Time Bias; Over Estimation of survival time due to early detection, not by improved Treatment
Chi-squared Test
A statistical test commonly used to compare observed data with data we would expect to obtain
according to a specific hypothesis. It requires that you use numerical values, not percentages or ratios.
Designed to test the correspondence between a theoretical frequency distribution & an observed
frequency distribution of categorical data if one sample of 20 patients is 30% hypertensive and another
comparison group of 25 patients is 60% hypertensive, a chi-squared test can be used to determine if this
variation is different than might be expected due to chance alone
Prevalence
Prevalence is the total number of cases in a population over a defined period of time.
Point prevalence: attempts to measure the frequency of all disease at one specific point in time,
therefore knowledge of the time of onset of disease is not required
Period prevalence: measure constructed from prevalence at a point in time, plus new cases and
recurrences over a defined period of time