0% found this document useful (0 votes)
39 views

Module 5

Hypothesis testing involves making statements about unknown population parameters based on sample data. The key elements of a hypothesis test are the null hypothesis (H0), alternative hypothesis (HA), test statistic, and rejection region. Hypothesis tests balance type I and type II errors. Sample distributions are used to test hypotheses and calculate p-values and confidence intervals. Power refers to the probability of correctly rejecting a false null hypothesis and depends on factors like sample size and effect size. Sample size calculations aim to have sufficient power to detect clinically meaningful differences.

Uploaded by

Jagadeswar Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Module 5

Hypothesis testing involves making statements about unknown population parameters based on sample data. The key elements of a hypothesis test are the null hypothesis (H0), alternative hypothesis (HA), test statistic, and rejection region. Hypothesis tests balance type I and type II errors. Sample distributions are used to test hypotheses and calculate p-values and confidence intervals. Power refers to the probability of correctly rejecting a false null hypothesis and depends on factors like sample size and effect size. Sample size calculations aim to have sufficient power to detect clinically meaningful differences.

Uploaded by

Jagadeswar Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Hypothesis testing

Hypothesis Testing

• Goal: Make statement(s) regarding unknown population


parameter values based on sample data
• Elements of a hypothesis test:
– Null hypothesis - Statement regarding the value(s) of
unknown parameter(s). Typically will imply no association
between explanatory and response variables in our
applications (will always contain an equality)
– Alternative hypothesis - Statement contradictory to the
null hypothesis (will always contain an inequality)
– Test statistic - Quantity based on sample data and null
hypothesis used to test between null and alternative
hypotheses
– Rejection region - Values of the test statistic for which we
reject the null in favor of the alternative hypothesis
Hypothesis Testing

Test Result – H0 True H0 False

True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision

  P(Type I Error )   P(Type II Error )

• Goal: Keep a, b reasonably small


Example - Efficacy Test for New drug

• Drug company has new drug, wishes to


compare it with current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than
current treatment to receive approval
• Firm runs clinical trial where some patients
receive new drug, and others receive standard
treatment
• Numeric response of therapeutic effect is
obtained (higher scores are better).
• Parameter of interest: mNew - mStd
Example - Efficacy Test for New drug
• Null hypothesis - New drug is no better than standard trt

H 0 :  New  Std  0  New  Std  0


• Alternative hypothesis - New drug is
better thanHstandard trt
A :  New   Std  0

• Experimental (Sample) data:


y New y Std
s New sStd
nNew nStd
Sampling Distribution of Difference in Means

• In large samples, the difference in two sample means


is approximately normally distributed:
  12  22 
Y 1  Y 2 ~ N  1   2 , 
 n1 n2 

• Under the null hypothesis, m1-


m2=0 and:Z  Y  Y ~ N (0,1)
1 2
2
 2
 2
1
n1 n2

• s12 and s22 are unknown and


Example - Efficacy Test for New drug

• Type I error - Concluding that the new drug is better than the
standard (HA) when in fact it is no better (H0). Ineffective drug
is deemed better.
– Traditionally a = P(Type I error) = 0.05

• Type II error - Failing to conclude that the new drug is better


(HA) when in fact it is. Effective drug is deemed to be no
better.
– Traditionally a clinically important difference (D) is
assigned and sample sizes chosen so that:
b = P(Type II error | m1-m2 = D)  .20
Elements of a Hypothesis Test
• Test Statistic - Difference between the Sample means,
scaled to number of standard deviations (standard errors)
from the null difference of 0 for the Population means:

y1  y 2
T .S . : zobs 
s12 s22

n1 n2
• Rejection Region - Set of values of the test
statistic that are consistent with HA, such that
the probability it falls in this region when H0
is true is a (we will always set a=0.05)
R.R. : zobs  z   0.05  z  1.645
P-value (aka Observed Significance Level)

• P-value - Measure of the strength of evidence the sample data provides against
the null hypothesis:

P(Evidence This strong or stronger against H0 | H0 is true)

P  val : p  P(Z  zobs )


Large-Sample Test H0:m1-m2=0 vs H0:m1-m2>0

• H0: m1-m2 = 0 (No difference in population means


• HA: m1-m2 > 0 (Population Mean 1 > Pop Mean 2)

y1  y 2
 T . S . : z obs 
s 12 s 22

n1 n2
 R . R . : z obs  z 
 P  value : P ( Z  z obs )

• Conclusion - Reject H0 if test statistic falls


in rejection region, or equivalently the P-
2-Sided Tests

• Many studies don’t assume a direction wrt the

difference m1-m2
• H0: m1-m2 = 0 HA: m1-m2  0
• Test statistic is the same as before
• Decision Rule:
– Conclude m1-m2 > 0 if zobs  za/2 (a=0.05  za/2=1.96)
– Conclude m1-m2 < 0 if zobs  -za/2 (a=0.05  -za/2= -1.96)
– Do not reject m1-m2 = 0 if -za/2  zobs  za/2
• P-value: 2P(Z |zobs|)
Power of a Test

• Power - Probability a test rejects H0 (depends on m1- m2)


– H0 True: Power = P(Type I error) = a
– H0 False: Power = 1-P(Type II error) = 1-b

· Example:
· H0: m1- m2 = 0 HA: m1- m2 > 0
· s = s2 = 25 n1 = n2 = 25
1
2 2

· Decision Rule: Reject H0 (at a=0.05 significance level) if:

y1  y 2 y1  y 2
z obs    1 .645  y 1  y 2  2 .326
 2
 2
2
1
 2
n1 n2
Power of a Test

• Now suppose in reality that m1-m2 = 3.0 (HA is true)


• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at
least 2.326 higher than the sample mean for group 2
• Power for this case can be computed as:

P(Y 1  Y 2  2.326) Y 1  Y 2 ~ N (3, 2.0  1.414)


Power of a Test

2.326 3
Power P(Y1 Y 2  2.326)  P(Z   0.48)  .6844
1.41

• All else being equal:


• As sample sizes increase, power increases
• As population variances decrease, power increases
• As the true mean difference increases, power
increases
Power of a Test
Distribution (H0) Distribution (HA)
Power of a Test

Power Curves for group sample sizes of


25,50,75,100 and varying true values m1-
m2 with s1=s2=5.
Sample Size Calculations for Fixed Power

• Goal - Choose sample sizes to have a favorable chance of


detecting a clinically meaning difference
• Step 1 - Define an important difference in means:
– Case 1: s approximated from prior experience or pilot study -
dfference can be stated in units of the data
– Case 2: s unknown - difference must be stated in units of
standard deviations of the data

1   2


• Step 2 - Choose the desired power to
detect the the clinically meaningful
2z / 2  z  
2
difference (1-nb , typically at2 least .80). For 2-
1  n2 

sided test:
Example - Rosiglitazone for HIV-1 Lipoatrophy

• Trts - Rosiglitazone vs Placebo


• Response - Change in Limb fat mass
• Clinically Meaningful Difference - 0.5 (std dev’s)
• Desired Power - 1-b = 0.80
• Significance Level - a = 0.05

z / 2  1.96 z   z.20  .84


21.96  0.84
2
n1  n2   63
(0.5) 2
Source: Carr, et al (2004)
Confidence Intervals

• Normally Distributed data - approximately 95%


of individual measurements lie within 2
standard deviations of the mean
• Difference between 2 sample means is
approximately normally distributed in large
samples (regardless of shape of distribution of
individual measurements):

  12  22 

Y 1  Y 2 ~ N 1   2 , 
 n n2 
 1

• Thus, we can expect (with 95%


confidence) that our sample mean
(1-a)100% Confidence Interval for m1-m2

• Large sample Confidence


Interval for m -m :
y  s2 s2
1 2

1  y 2  z / 2 1
 2
n1 n2

• Standard level of confidence is 95%


(z.025 = 1.96  2)
• (1-a)100% CI’s and 2-sided tests
reach the same conclusions
Example - Viagra for ED

• Comparison of Viagra (Group 1) and Placebo (Group 2)


for ED
• Data pooled from 6 double-blind trials
• Subjects - White males
• Response - Percent of succesful intercourse attempts in
past 4 weeks (Each subject reports his own percentage)

y1  63.2 s1  41.3 n2  264


y 2  23.5 s2  42.3 n2  240

95% CI for
m(63 (41.3)2 (42.3)2
1-.2m : .5) 1.96
223   39.7  7.3  (32.4,47.0)
264 240
Source: Carson, et al (2002)
ANOVA
ANOVA: Comparing Several Means
• The statistical methodology for comparing
several means is called analysis of variance, or
ANOVA.
• In this case one variable is categorical.
– This variable forms the groups to be compared.
• The response variable is numeric.
• This methodology is the extension of
comparing two means.
ANOVA: Comparing Several Means

• Examples:
– “An investigator is interested in studying the average number of days rats
live when fed diets that contain different amounts of fat. Three
populations were studied, where rats in population 1 were fed a high-fat
diet, rats in population 2 were fed a medium-fat diet, and rats in
population 3 were fed a low-fat diet. The variable of interest is ‘Days
lived.’” (from Graybill, Iyer and Burdick, Applied Statistics, 1998).
– “A state regulatory agency is studying the effects of secondhand smoke in
the workplace. All companies in the state that employ more than 15
workers must file a report with the agency that describes the company’s
smoking policy. In particular, each company must report whether (1)
smoking is allowed (no restrictions), (2) smoking is allowed only in
restricted areas, or (3) smoking is banned. In order to determine the
effect of secondhand smoke, the state agency needs to measure the
nicotine level at the work site. It is not possible to measure the nicotine
level for every company that reports to the agency, and so a simple
random sample of 25 companies is selected from each category of
smoking policy.” (from Graybill, Iyer and Burdick, Applied Statistics, 1998).
Assumptions for ANOVA

1. Each of the I population or group distributions is normal. -


check with a Normal Quantile Plot (or boxplot) of each group
2. These distributions have identical variances (standard
deviations).
-check if largest sd is > 2 times smallest sd
3. Each of the I samples is a random sample.
4. Each of the I samples is selected independently of one another.
ANOVA: Comparing Several Means

H0 : 1   2     I

where I is the number of


populations to be compared
The alternative hypothesis (step 2) is
The null hypothesis (step 1) for comparing several means is

H a : not all of the i are equal


(at least one of the means
is different from the others)
ANOVA: Comparing Several Means

Mean Squares Group MSG


F or
Mean Squares Error MSE

This compares the variation between groups


(group mean to group mean) to the variation
within groups (individual values to group
means).
• Step 3: State the significance level
• Step 4: Calculate the F-statistic:
This is what gives it the name “Analysis of
Variance.”
ANOVA: Comparing Several Means
Pr( Fdf1 ,df 2  Fcalculated )
where df1 = I – 1 (number of
groups minus 1) and
df2 = N – I (total sample size
minus number of groups).
• Step 5: Find the P-value
– The P-value for an ANOVA F-test is always one-sided.
– The P-value is
P-value
F-
distribution
ANOVA: Comparing Several Means

• Step 6. Reject or fail to reject H0 based on the P-value.


– If the P-value is less than or equal to a, reject H0.
– It the P-value is greater than a, fail to reject H0.
• Step 7. State your conclusion.
– If H0 is rejected, “There is significant statistical
evidence that at least one of the population means
is different from another.”
– If H0 is not rejected, “There is not significant
statistical evidence that at least one of the
population means is different from another.”
ANOVA Table
Source df Sum of Squares Mean Square F p-value

Group I–1 SSG MSG Pr( F  Fcalc )


 n (x  x) 2
 SSG  MSG  Fca lc
(between) i i
dfG MSE
Error N–I SSE
 (n 1)s
2
 SSE  MSE
(within) i i
dfE
Total N–1 SSTot
 (x ij  x ) 2  SSTot
dfTot
 MST

Note: MSE is the pooled sample variance and SSG + SSE = SSTot
SSG is the proportion of the total variation explained by the
R2 
SSTot difference in means
ANOVA: Comparing Several Means

• Example: “An experimenter is interested in the effect


of sleep deprivation on manual dexterity. Thirty-two
(N) subjects are selected and randomly divided into
four (I) groups of size 8 (ni). After differing amount of
sleep deprivation, all subjects are given a series of
tasks to perform, each of which requires a high
amount of manual dexterity. A score from 0 (poor
performance) to 10 (excellent performance) is
obtained for each subject. Test at the a = 0.05 level
the hypothesis that the degree of sleep deprivation
has no effect on manual dexterity.” (from Milton,
McTeer, and Corbet, Introduction to Statistics, 1997)
ANOVA: Comparing Several Means

• Information Given
Sample size:
Stddev1 = 0.89316
N = 32
Stddev2 = 0.86603 Group I Group II Group III Group IV
Stddev3 = 0.64507 16 hours 20 hours 24 hours 28 hours
Stddev4 = 0.85206 8.95 7.7 5.99 3.78

8.04 5.81 6.79 3.35

7.72 6.61 6.43 2.45

6.21

6.48
6.07

8.04
5.85

5.78
4.27

4.87
Variation
7.81

7.5
5.96

7.3
7.6

5.78
3.14

3.98
within
6.9 7.46 6 2.47
groups

Variation between groups


Side by Side Boxplots
9.00

8.00

7.00

6.00

5.00

4.00

3.00

2.00

GroupI GroupII GroupIII GroupIV


Normal Quantile Plots

1.5 1.5

1.0 1.0
Expected Normal

Expected Normal
0.5 0.5

0.0 0.0

-0.5 -0.5

-1.0 -1.0

-1.5 -1.5

6.0 6.5 7.0 7.5 8.0 8.5 9.0 6.0 6.5 7.0 7.5 8.0
Observed Value Observed Value

1.5
1.5

1.0 1.0

E xpected Normal
Expected Normal

0.5
0.5

0.0

0.0

-0.5

-0.5
-1.0

-1.0
-1.5

5.5 6.0 6.5 7.0 7.5 8.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Observed Value Observed Value
ANOVA: Comparing Several Means

• Information Given

7.45 Error Bars show Mean +/- 1.0 SE




6.87 Dot/Lines show Means


7.00 

Variation
6.28


Within
6.00
dexter

Groups 5.00

4.00


3.54
Average
Within
16 hours 20 hours 24 hours 28 hours Group
deprived Variation
(MSE)
ANOVA: Comparing Several Means
• Information Given

Dot/Lines show Means

7.00 

6.00
dexter

5.00
Average
Between
Variation Group
Variation
Between
4.00

 (MSG)

Groups 16 hours 20 hours 24 hours

deprived
28 hours
ANOVA: Comparing Several Means

H 0 : 1   2   3   4

Step 2: The alternative hypothesis is


Ha : not all of the i are equal

Step 3: The significance level is a =


0.05
Step 1: The null hypothesis is
ANOVA: Comparing Several Means

Mean Square Group MSG 23.976  35.73


F or 
Mean Square Error MSE 0.671
MSG and MSE are found in the ANOVA table
when the analysis is run on the computer:
• Step 4: Calculate the F-statistic:

M M
ANOVA

DEXTER
Sum of
SG SE
Squares df Mean Square F Sig.
Between Groups 71.928 3 23.976 35.730 .000
Within Groups 18.789 28 .671
Total 90.716 31
ANOVA: Comparing Several Means

Pr( Fdf1 ,df 2  Fcalculated )  Pr( Fdf1 ,df2  35.73)


 .0001

where df1 = I – 1 (number of groups


minus 1) = 4 – 1 = 3 and df2 = N – I
(total sample size minus I) = 32 – 4 = 28
• Step 5:ANOVA
Find the P-value
DEXTER

Sum of The P-value is
Squares df Mean Square F Sig.
Between Groups 71.928 3 23.976 35.730 .000
Within Groups 18.789 28 .671
Total 90.716 31

35.7
3
ANOVA: Comparing Several Means

• Step 6. Reject or fail to reject H0 based on the P-value.

– Because the P-value is less than a =


0.05, reject H0.
• Step 7. State your conclusion.

– “There is significant statistical


evidence that at least one of the
population means is different from
another.”
An additional test will tell us which
means are different from the others.
Non-Parametric Tests
Level of One-sample Two-sample case K-sample case
measurement test Related Samples Independent samples Related Independent
samples samples
Nominal Binomial McNemar for Fisher exact Cochran Q Chi-square
significance of probability (Dichotomous)
changes Chi-square
Ordinal Kolmogorov Sign Wilcoxon Mann-Whitney U Friedman Kruskal-Wallis
Smirnov matched-pair Kolmogorov-Smirnov two-way one-way
signed-ranks analysis of analysis of
Runs Wald-Wolfowitz runs variance variance

Moses of extreme Kendall’s W


reactions
Interval Walsh Randomization
• Chi-square – tests whether the observed
distribution is the same as a certain hypothesized
distribution. 


The default null hypothesis is even distribution.  
• Kolmogorov-Smirnov – Compares the distribution of
a variable with a uniform, normal, Poisson, or
exponential distribution,
• Null hypothesis: the observed values were sampled
from a distribution of that type.
Runs
• A run is defined as a sequence of cases on the
same side of the cut point. (An uninterrupted
course of some state or condition, for e.g. a
run of good luck).
• You should use the Runs Test procedure when
you want to test the hypothesis that the
values of a variable are ordered randomly with
respect to a cut point of your choosing
(Default cut point: median.
• E.g. If you ask 20 students about how well they understand a
lecture on a scale ranged from 1 to 5 (and the median in the
class is 3). If you find that, the first 10 students give a value
higher than 3 and the second 10 give a value lower than 3
(there are only 2 runs). 5445444545 2222112211
• For random situation, there should be more runs (but will not
be close to 20, which means they are ordered exactly in an
alternative fashion; for example a value below 3 will be
followed by one higher than it and vice versa).
2,4,1,5,1,4,2,5,1,4,2,4
• The Runs Test is often used as a precursor to running tests
that compare the means of two or more groups, including:
– The Independent-Samples T Test procedure.
– The One-Way ANOVA procedure.
– The Two-Independent-Samples Tests procedure.
– The Tests for Several Independent Samples procedure.
Sample cases (Related Samples)

• McNemar – tests whether the changes in proportions


are the same for pairs of dichotomous variables.
McNemar’s test is computed like the usual chi-square
test, but only the two cells in which the classification
don’t match are used.
• Null hypothesis: People are equally likely to fall into
two contradictory classification categories.
• Sign test – tests whether the numbers of differences (+ve
or –ve) between two samples are approximately the
same. Each pair of scores (before and after) are compared.
• When “after” > “before” (+ sign), if smaller (- sign). When
both are the same, it is a tie.
• Sign-test did not use all the information available (the size
of difference), but it requires less assumptions about the
sample and can avoid the influence of the outliers.
• To test the association between the following two
perceptions
• Social workers help the disadvantaged and Social
workers bring hopes to those in averse situation
• Wilcoxon matched-pairs signed-ranks test – Similar to sign test,
but take into consideration the ranking of the magnitude of the
difference among the pairs of values.  (Sign test only considers
the direction of difference but not the magnitude of differences.)
• The test requires that the differences (of the true values) be a
sample from a symmetric distribution (but not require normality).
It’s better to run stem-and-leaf plot of the differences.
Two-sample case (independent samples)

• Mann-Whitney U – similar to Wilcoxon matched-paired


signed-ranks test except that the samples are independent and
not paired. It’s the most commonly used alternative to the
independent-samples t test.
• Null hypothesis: the population means are the same for the
two groups.
• The actual computation of the Mann-Whitney test is simple.
You rank the combined data values for the two groups. Then
you find the average rank in each group.
• Requirement: the population variances for the two groups
must be the same, but the shape of the distribution does not
matter.
• Kolmogorov-Smirnov Z– to test if two
distributions are different.  It is used when there
are only a few values available on the ordinal
scale.  K-S test is more powerful than M-W U test
if the two distributions differ in terms of
dispersion instead of central tendency.
K-sample case
(Independent samples)

• Kruskal-Wallis One-way ANOVA – It’s more powerful


than Chi-square test when ordinal scale can be
assumed. It is computed exactly like the Mann-
Whitney test, except that there are more groups. The
data must be independent samples from populations
with the same shape (but not necessarily normal).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy