GWAS
GWAS
(GWAS) data
Jim Stankovich
Menzies Research Institute
University of Tasmania
J.Stankovich@utas.edu.au
Outline
• Introduction
• Confounding variables and linkage disequilibrium
• Statistical methods to test for association in case-control GWA
studies
– Allele counting chi-square test
– Logistic regression
• Multiple testing and power
• Example: GWAS for multiple sclerosis (MS)
– Data cleaning / quality control
– Results
GWA studies have been very successful since 2007
• Prior to the advent of GWA studies, there was very little success in
identifying genetic risk factors for complex multifactorial diseases
• GWA studies have identified over 200 separate associations with
various complex diseases in the past two years
• “Human Genetic Variation” hailed as “Breakthrough of the Year” by
Science magazine in 2007
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Northern European
ancestry
GG GT TT Total
Cases r0 r1 r2 R
Controls s0 s1 s2 S
Total n0 n1 n2 N
GG GT TT Total
Cases r0 r1 r2 R
Controls s0 s1 s2 S
Total n0 n1 n2 N
GG GT TT Total
Cases r0 r1 r2 R
Controls s0 s1 s2 S
Total n0 n1 n2 N
Consider all the G alleles in the sample, and pick one at random.
The odds that the G allele occurs in a case: a/c
Consider all the T alleles in the sample, and pick one at random.
The odds that a T allele occurs in a case: b/d
If the disease is rare (e.g. ~0.1% for MS), the odds ratio is roughly equal to
the genotype relative risk (GRR):
the increase in risk of disease conferred by each additional G allele
e.g. if OR = 1.2,
Pr(MS | TT) = 0.1% Pr(MS | GT) = 0.12% Pr(MS | GG) = 0.144%
Logistic regression: more flexible analysis for GWA studies
• Similar to linear regression, used for binary outcomes instead of
continuous outcomes
• Let Yi be the phenotype for individual i
Yi = 0 for controls
Yi = 1 for cases
• Let Xi be the genotype of individual i at a particular SNP
TT Xi = 0
GT Xi = 1
GG Xi = 2
Logistic regression: more flexible analysis for GWA studies
• Similar to linear regression, used for binary outcomes instead of
continuous outcomes
• Let Yi be the phenotype for individual i
Yi = 0 for controls
Yi = 1 for cases
• Let Xi be the genotype of individual i at a particular SNP
TT Xi = 0
GT Xi = 1
GG Xi = 2
• Basic logistic regression model
Let pi = E(Yi | Xi), expected value of pheno given geno
Define logit(pi) = loge[pi /(1- pi) ]
Logistic regression: more flexible analysis for GWA studies
• Similar to linear regression, used for binary outcomes instead of
continuous outcomes
• Let Yi be the phenotype for individual i
Yi = 0 for controls
Yi = 1 for cases
• Let Xi be the genotype of individual i at a particular SNP
TT Xi = 0
GT Xi = 1
GG Xi = 2
• Basic logistic regression model
Let pi = E(Yi | Xi), expected value of pheno given geno
Define logit(pi) = loge[pi /(1- pi) ]
logit(pi) ~ β0 + β1 Xi
Logistic regression: more flexible analysis for GWA studies
• Similar to linear regression, used for binary outcomes instead of
continuous outcomes
• Let Yi be the phenotype for individual i
Yi = 0 for controls
Yi = 1 for cases
• Let Xi be the genotype of individual i at a particular SNP
TT Xi = 0
GT Xi = 1
GG Xi = 2
• Basic logistic regression model
Let pi = E(Yi | Xi), expected value of pheno given geno
Define logit(pi) = loge[pi /(1- pi) ]
logit(pi) ~ β0 + β1 Xi
Test whether β1 differs significantly from zero:
roughly equivalent to allele counting chi-square test
PLINK --logistic
Multiple testing
• Suppose you test 500,000 SNPs for association with disease
• Expect around 500,000 x 0.05 = 25,000 to have p-value less than 0.05
• More appropriate significance threshold
p = 0.05 / 500,000 = 10-7
genome-wide significance
• In our MS GWAS we considered SNPs for follow-up if they had p-
values less than 0.001
• To detect a smaller p-value need a larger study
The power to detect an association
• Suppose the G allele of a SNP has frequency 0.2. If each additional
G allele increases odds of disease by 1.2, and 1618 cases and 3413
controls are genotyped, what is the power (chance) of detecting an
association with significance p<0.001?
p=0.001
p=0.001
p=0.001
p=0.001
www.msif.org/en/about_ms/ demyelination.html
Multiple sclerosis
• neurodegenerative disease
• autoimmune attack on myelin sheaths around nerve cells
• more females affected than males (3:1)
• average age-at-onset ~30
• ~16,000 people with MS in Australia ($2 billion p.a.)
• no cure
Risk factors
• Epstein-Barr virus
• Exposure to infant siblings (Ponsonby et al, JAMA, 2005)
• Latitude gradient, childhood sun exposure
(van der Mei et al, Lancet, 2003)
• Only genetic risk factor known before 2007 (first GWAS):
HLA-DRB1*1501 discovered in 1972
(60% MS and 30% controls) IL7R CD58
IL2RA EVI5/RPL5
CLEC16A CD226
KIF1B
TYK2
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
N European
Principal components
African
analysis: EIGENSTRAT
Price et al (2006).
Nat Genet 38: 904
Japanese
Chinese Use an independent
set of ~77,000 SNPs
--indep-pairwise
• Closer look at SNPs with call rates between 5% and 10% suggested that
they were unreliable
GWAS - results
P=10-7
# 50
# 500
P=0.001
Extra QC for associated SNPs: cluster plots
P=4.1 x 10-6
P=0.00001
P=0.0001
rs703842
GWAS
P = 4.1 x 10-6
replication
P = 1.4 x 10-6
GWAS + rep
P = 5.4 x 10-11
Allele
frequency 0.33
P=4.1 x 10-6
P=0.00001
P=0.0001
P=4.1 x 10-6
P=0.00001
P=0.0001
P=4.1 x 10-6
P=0.00001
P=0.0001
25(OH)D3 UVB
Liver
7 dehydro-
cholesterol
Vit D
Diet
HLA-DRB1*1501
P=0.0001
rs6074022
GWAS
P = 2.5 x 10-5
replication
P = 4.6 x 10-4
GWAS + rep
P = 1.3 x 10-7
Allele
frequency 0.25
www.hapmap.org
Another use of logistic regression:
test for gene-gene interaction
MS risk alleles
Chr 12 = rs703842A
Chr 20 = rs6074022G
Odds
ratio
Chr20 risk
Chr12 risk
Modest evidence that each risk allele has a bigger effect in the presence of
the other risk allele (p = 0.03)
Summary