Timing, Rates and Spectra of Human Germline Mutation: Articles
Timing, Rates and Spectra of Human Germline Mutation: Articles
Timing, Rates and Spectra of Human Germline Mutation: Articles
Germline mutations are a driving force behind genome evolution and genetic disease. We investigated genome-wide mutation
rates and spectra in multi-sibling families. The mutation rate increased with paternal age in all families, but the number of
2015 Nature America, Inc. All rights reserved.
additional mutations per year differed by more than twofold between families. Meta-analysis of 6,570 mutations showed that
germline methylation influences mutation rates. In contrast to somatic mutations, we found remarkable consistency in germline
mutation spectra between the sexes and at different paternal ages. In parental germ line, 3.8% of mutations were mosaic,
resulting in 1.3% of mutations being shared by siblings. The number of these shared mutations varied significantly between
families. Our data suggest that the mutation rate per cell division is higher during both early embryogenesis and differentiation
of primordial germ cells but is reduced substantially during post-pubertal spermatogenesis. These findings have important
consequences for the recurrence risks of disorders caused by de novo mutations.
Mutations have manifold consequences, from driving evolution to 20-year-old male, with the number rising to 610 genome replications
causing disease. DNA damage can have exogenous causes, such as in a 40-year-old male15.
ionizing radiation and mutagenic chemicals, or endogenous causes, Mutation rate depends on local nucleotide context. Moreover, studies
such as oxidative respiration and errors in DNA replication1,2. of somatic mutations in cancer have shown that observed mutation spec-
Both endogenous and exogenous damage are restored by DNA repair tra can be decomposed into different mutational signatures that reflect
pathways, which are highly conserved in mammals2. However, damage particular cellular contexts of exogenous and endogenous mutagen
repair pathways are not perfect, and de novo mutations (DNMs) occur exposure and the efficiency of different DNA repair pathways16.
in every generation. The germ line comprises a lineage of different cellular contexts,
Knowledge of the rates and mechanisms by which germline muta- from the zygote to the gamete17 (Supplementary Fig. 1). Postzygotic
tions arise has diverse applications, from empowering the discovery of mutations can potentially lead to germline mosaicism. Observing
the genetic causes of rare disorders3 to dating critical periods in human apparent DNMs shared by siblingspredominantly in studies of
evolution4. On the basis of whole-genome sequencing studies of family dominant disordershas provided direct evidence for germline
trios, the average generational mutation rate for single-base substitu- mosaicism18. Although recent studies have determined the average
tions in humans has been estimated to be ~11.5 108 (refs. 59). germline mutation rate and estimated the average effect of paternal
In 1947, J.B.S. Haldane noted that the mutation rate of the hemo- age, a deeper understanding of germline mutational rates and spectra
philia-associated gene is significantly higher in men than in women10. and the underlying mutational processes remains elusive. For example,
Recent genome sequencing studies have confirmed Haldanes obser- it is not known whether mutation spectra differ between paternal
vation that the male germ line is more mutagenic58,11. On average, and maternal germ lines, whether mutation rates and spectra vary
each additional year in fathers age at conception results in ~2 addi- significantly between families, or whether different stages of the
tional DNMs in the child6. Correspondingly, the risk of dominant cellular lineage from the zygote to the gamete differ in their mutation
genetic disorders in the child increases with increasing paternal rates and spectra.
age12,13. The most likely cause of the paternal age effect is the increas- Here we investigated human germline mutations within and in
ing number of cell divisions in the male germ line14. Whereas oocytes comparisons of multi-sibling families. This approach allowed us to
are produced early in a womans life and have a fixed number of compare mutation rates and spectra between families and to detect
genome replications, spermatogenic stem cells undergo continuous instances of postzygotic mosaicism. We also investigated mutational
genome replication throughout a mans life. It has been estimated processes and spectra more broadly by combining our data with
that the male germ line has experienced 160 genome replications in a previously published data sets.
1Wellcome Trust Sanger Institute, Hinxton, UK. 2Department of Human Genetics, Genentech, Inc., South San Francisco, California, USA. 3Department of
Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, California, USA. 4Institute of Cardiovascular and Medical Sciences, College of
Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK. 5Medical Research Institute, University of Dundee, Dundee, UK. 6Institute of Genetics
and Molecular Medicine, University of Edinburgh, Edinburgh, UK. 7A list of members and affiliations appears in the Supplementary Note. 8These authors contributed
equally to this work. Correspondence should be addressed to M.E.H. (meh@sanger.ac.uk).
Figure 1 Pedigrees of the sequenced families. Family 244 Family 569 Family 603
Identifiers and relationship between individuals
19
15
31
29
18
16
are shown for the three families in this study.
53
53
53
53
53
53
16
16
16
16
16
16
Individuals who were sequenced are represented
S5
S5
S5
S5
S5
S5
H
H
by circles or squares with solid outlines; other
SF
SF
SF
SF
SF
SF
individuals are represented by circles or squares
with dotted outlines. The ages of the mother and
father at conception of each child and phasing
information are summarized in the table.
SFHS5165332
SFHS5165333
SFHS5165325
SFHS5165322
SFHS5165328
SFHS5165321
SFHS5165326
SFHS5165314
SFHS5165323
SFHS5165324
SFHS5165320
SFHS5165330
SFHS5165317
SFHS5165321 was only used for the part of the
analysis related to mosaicism.
RESULTS
Family-specific paternal age effects
Age (years)
Mother 23 25 27 35 24 27 31 34 37 26 28 34 38
We sequenced the genomes of three multi-
sibling families (Fig. 1). We discovered and Father 25 27 29 37 24 27 31 34 37 23 25 31 35
validated 768 DNMs across the three families,
with an average of 64 DNMs per child (range De novo SNVs 59 62 65 74 45 63 81 84 43 49 68 75
Detection power
excess of alternative reads in the mothers blood; P, mosaic sites
0.6 0.02
with a significant excess of alternative reads in the fathers blood;
S, sites that are shared by the siblings but for which an excess of 0.4
alternative reads could not be detected in blood from either parent; 0.01
SM and SP, mosaic sites shared by the siblings for which a significant 0.2
excess of alternative reads was detected in the mothers blood
0 0
(SM; pink dots) or fathers blood (SP; dark blue dot).
0 5 10 15 DNM M P S SM/SP
Level of mosaicism in the
that we have ~80% power to detect a mosaic variant present in 1% of parents blood (%)
parental blood cells and ~90% power to detect a variant present in
2% of parental blood cells.
Six of the ten DNMs shared by siblings also exhibited paren- mosaicism (Table 1). Correcting for our incomplete power to detect
tal somatic mosaicism, which is a significant enrichment in com- mosaic mutations (Fig. 3a) suggests that 4.2% of germline mutations
parison to mutations observed in a single sibling (P = 4.6 107, may be mosaic in >1% of parental blood cells (Online Methods).
Fishers exact test). Four DNMs were shared by siblings without Of the parental mosaic DNMs, 64% (16/25) were maternal in
2015 Nature America, Inc. All rights reserved.
excess alternative reads in parental blood. Hence, these mutations origin. This is compatible with a 1:1 ratio of paternal and mater-
either occurred after the separation of the germ line and soma or nal somatic mosaicism but represents a significantly different ratio
correspond to parental somatic mosaicism below detectable levels. of parental origin than the paternal bias observed for all 768 DNMs
In total, 29 of the validated DNMs had evidence of parental germline (P = 7.7 106, binomial test). This ratio is not likely to be due
Proportion of DNMs
0.25
Proportion of DNMs
0.25
Proportion of DNMs
0.30
0.20 0.20
0.25
0.15 0.20 0.15
0.15 0.10
0.10
0.10
0.05 0.05
0.05
0 0 0
>A
>G
>T
pG
>A
>G
>T
pG
>A
>G
>T
pG
G
T>
T>
T>
T>
T>
T>
T>
T>
T>
C
C
C
C
>T
>T
>T
C
C
pG
pG
pG
C
C
Figure 4 Mutational spectra. (a) Frequency of all mutation types in the catalog of 6,570 high-confidence DNMs. (b) Difference in the frequency of
maternal and paternal mutations for the subset of DNMs with phasing information (n = 556). (c) Difference in the frequency of mutations of children
from fathers younger and older than 30 years (n = 680). Error bars, 95% confidence intervals.
to differential sequencing coverage for mothers and fathers the mutation spectra observed for DNMs19, as the ratio of C:G>T:
(Supplementary Fig. 3). A and T:A>C:G transitions decreased dramatically with increasing
derived allele frequency, most likely because of biased gene conver-
2015 Nature America, Inc. All rights reserved.
Germline mutational spectra sion20 (Supplementary Fig. 4). We did not observe any statistically
We compiled a catalog of 6,570 high-confidence DNMs from 109 trios significant difference (P = 0.10, 2 test) in the X-chromosome and
based on six different sources, including the families we sequenced Y-chromosome mutation spectra (number of variants = 3,217) after
for this project (Supplementary Table 3). All DNMs were called from accounting for differences in base composition between the chromo-
whole-genome sequencing data. For 10% of the mutations, data on somes (Online Methods and Supplementary Fig. 5). This confirms
parental origin were available. our observation above that, despite differences in mutation rates,
We used this catalog to evaluate evidence for distinct germline muta- numbers of genome divisions and cellular contexts, the mutation
tional processes. Low-resolution mutational spectra, which we define as spectra in the maternal and paternal germ lines are very similar.
the relative frequencies of the six possible point mutations, confirmed To investigate the contribution to germline mutation of 30 pre-
the expected preponderance of transitions over transversions (Fig. 4a). viously identified and validated mutational signatures operative
There was no significant difference between the spectra of maternal and in somatic lineages leading to cancer16, we characterized higher-
paternal mutations (P = 0.19, 2 test; Fig. 4b). Even though there was a resolution mutational spectra. For this analysis, we calculated the rela-
significant difference in the magnitude of the paternal age effect between tive frequency of mutations at the 96 triplets defined by the mutated
the three families, there was no significant difference between their muta- base and the bases flanking it on each side (Fig. 5a). The spectrum
tional spectra (P = 0.925, 2 test) nor between the spectra of DNMs for observed for germline mutations clearly recapitulated the known
children born to younger and older fathers (P = 0.83, 2 test; Fig. 4c). higher mutability of CpG dinucleotides.
As an independent assessment of potential differences in maternal We evaluated whether any combination of the 30 previously iden-
and paternal mutation spectra, we contrasted variants identified on tified signatures16 was sufficient to explain the observed pattern of
the X and Y chromosomes in a genome-wide
sequencing data set based on 2,453 individu-
als from the UK10K project. All variation on a C>A C>G C>T T>A T>C T>G
5
the Y chromosome arose in the male germ
Proportion of point
14
the 3 and 5 nucleotides flanking the mutation. 20
15
We note that C:G>T:A and T:A>C:G transitions 11
Frequency
12
23
are more common. Within these categories, 7
21 20
CpG site mutations are particularly frequent. 25
2
10
(b) Correlation of mutational signatures with 9
8
observed mutations in the mutational catalog. 29
3
13
Correlation is shown for each of the 30 17
18
28
signatures, with signatures 1 and 5 highlighted 27
24
in orange. (c) Combination of all possible pairs 4
22
0
of signatures; the combination of signatures 1 0.20 0.05 0.30 0.55 0.80 0.20 1.0
and 5 is indicated with an arrow. Correlation of individual mutational signatures Correlation of pairs of signatures
Figure 6 Mutation rate model during Pre-PGC Post-PGC Post-puberty Pre-PGC Post-PGC
gametogenesis. Comparison of mutation 10 cell divisions 24 cell divisions 23 cell divisions 10 cell divisions 20 cell divisions
rates between spermatogenesis (blue box)
0.70
and oogenesis (red box). p and m are the
germline mutations (Fig. 5b). Two of the mutational signatures, with parental age is driven by paternal mutations, we suggest that this
2015 Nature America, Inc. All rights reserved.
previously termed signature 1 (25% of DNMs) and signature 5 (75% observation could result from variation among males either in the rate
of DNMs), explained the majority of the observed mutational pat- of turnover of spermatogenic stem cells or in the mutation rate per
terns (Pearson correlation = 0.98; Fig. 5c). Including any additional cell division. A recent review noted that the strength of the paternal
mutational signatures did not significantly improve this correlation. age effect differs between studies23. Although this variation could be
Signature 1 is characterized by C:G>T:A mutations at CpG dinucle- due to study design or analysis choices, our results highlight a more
otides, whereas signature 5 is predominately characterized by T:A>C: interesting possibility, namely that the paternal age effect actually
G mutations (Supplementary Fig. 6). These signatures are responsible differed between the studies because of the families included, with
for generation of the majority of spontaneous preneoplastic somatic most of the studies having a limited sample size.
mutations16, indicating that the mutational processes underlying these We observed no difference in the mutation spectra for the maternal
signatures in somatic cells are also operative in the germ line. and paternal germ lines or with younger and older fathers. The lack
Methylated CpG sites spontaneously deaminate, leading to TpG of large differences in mutation spectra between the sexes is perhaps
sites and increasing the number of C:G>T:A mutations21. To test counterintuitive given the different cellular contexts in the maternal
whether methylation status in the germ line has a detectable impact and paternal germ lines, including the marked difference in the number
on mutations, we obtained cell line methylation data for three cell of cell divisions and thus the increased potential for replication-
types that had been generated by reduced-representation bisulfite associated mutations in the paternal germ line. Larger catalogs of
sequencing as part of the Encyclopedia of DNA Elements (ENCODE) paternal and maternal mutations will be required to identify any
Project22. In the testis cell line, 25.3% of CpG sites had more than 50% subtler differences in germline mutation spectra.
of reads methylated (Supplementary Table 4). Thirteen of these sites We have shown that a combination of two previously identified
overlapped with DNMs from our catalog, of which 12 had more than mutational signatures operative in somatic cell lineages is sufficient
50% of reads methylated. This means that, in the testis cell line, meth- to explain the observed mutational spectrum of germline mutations.
ylated CpG sites are significantly more likely to mutate than unmeth- These two mutational signatures were originally extracted from
ylated ones (P = 1.71 108, binomial test). All 12 of the DNMs somatic mutations derived from diverse cancer genomes and thus
that were methylated in the testis cell line were CpG>TpG mutations likely reflect mutation processes operative across somatic tissues 16.
(Supplementary Table 5). For B-lymphocyte and embryonic stem cell This high concordance between the germ line and the soma sug-
lines, the association between methylation status and mutation was gests that the mutation processes underlying these two signatures
less significant (P = 0.04 and 2.39 106, respectively). are associated with maintenance and replication of DNA in all cells.
The generality of these two signatures and their underlying mutation
DISCUSSION processes across diverse cellular contexts likely explains our observation
We sequenced the genomes of three multi-sibling families, identified of an absence of appreciable age- or sex-dependent variation in muta-
candidate DNMs and validated 768 of them by targeted resequencing. tion spectrum. Nonetheless, despite this genome-wide concordance
Both the average genome-wide mutation rate of 1.28 108 mutations across different cellular lineages, our observation of increased
per nucleotide per generation and the ratio of paternal to maternal mutation rate at sites known to be methylated in a testis-derived cell
mutations (3.5) are slightly higher than but compatible with previous line demonstrates that DNA methylation and perhaps other cell type
estimates6. On average, the number of mutations in the child increased specific factors have a finer-grained role in influencing the precise
approximately linearly by 2.9 mutations with each additional year in location of mutations in specific cell types.
the parents ages. The magnitude of this effect differed by a factor With regard to the timing of mutations in the cellular lineage of
of greater than two between families. Although our observations the germ line, we have shown that at least 3.8% of DNMs are mosaic
corroborate a previous study6 that proposed that the major factor in at least 1% of parental blood cells. This estimate represents a lower
influencing the number of mutations in a child is paternal age, our bound on the true proportion of DNMs that are mosaic in paren-
multi-sibling study design allows detection of more subtle differences tal somatic tissues, as we only sampled a single somatic tissue and
between families. Given that the increase in the number of mutations cannot exclude the possibility of mosaicism at very low levels (<1%)
in that tissue. This proportion is compatible with a recent estimate very similar to that observed during maternal PGC proliferation and
for parental somatic mosaicism of copy number variants24. We infer differentiation to oogonia.
that DNMs that are mosaic in parental soma must have arisen From these observations, we derive a tentative model of germline
early on during embryonic development of the parent (within the mutation rate during gametogenesis (Fig. 6), with two phases of
first 812 cell divisions25,26), before the specification of primordial oogenesis and three phases of spermatogenesis, wherein the mutation
germ cells (PGCs) and the concomitant separation of the germ line rate per cell division is higher during early embryogenesis and during
from the soma. Whereas all DNMs showed a 3.5:1 ratio of paternal PGC proliferation and differentiation during later embryogenesis
to maternal mutations, these early mutations were compatible with and is reduced by ~3-fold during post-pubertal spermatogenesis.
a 1:1 ratio of paternal and maternal origin, as might be expected This model is consistent with prior inferences that the average muta-
given the occurrence of these mutations before sexual differentiation tion rate per cell division must be higher in the female germ line given
of the embryo. the relative number of cell divisions and the ratio of paternal and
We note that our observations seem incompatible with monophyletic maternal mutations, and this could be due to a lower error rate per
origins for the blood and germ line; instead, each tissue is likely to cell division after puberty in males23. It has previously been suggested
be founded by multiple cells with polyphyletic ancestry. A logical that the earliest embryonic divisions exhibit elevated mutagenicity
consequence is that some mesoderm founder cells are more closely with respect to structural variation28. Our data suggest that, for single-
related to PGCs within the cellular genealogy of the early embryo than nucleotide variants (SNVs), the main step change in mutation rate
they are to other mesoderm founder cells and vice versa. per cell division may be between the embryonic and post-pubertal
One limitation of our study is not having complete ascertainment phases of gametogenesis in males, and a similar observation has been
of all pre-PGC mutations. Mutations that arose in very early postzy- reported in mouse spermatogenesis29. If the model that we have pro-
gotic divisions may well be present at such high frequencies within posed above proves to be correct, then it suggests that evolutionary
2015 Nature America, Inc. All rights reserved.
parental tissues that our analytical workflow for identifying candidate selection may have acted to lower the mutation rate per cell division
DNMs fails to identify them on the basis that such sites are much during post-pubertal spermatogenesis, perhaps achieving a selective
more likely to be inherited variants with a biased sampling of alleles. balance between producing sufficient numbers of sperm to maintain
Moreover, pre-PGC mutations that arose in later cell divisions, only fertility and minimizing the deleterious mutation rate.
just before PGC specification, may be mosaic in parental somatic It is important to note that the estimated ranges for the mutation
tissues at such low levels that our deep resequencing was unable rates per cell division presented above represent a combination of
to identify them. Nonetheless, the 20-fold difference in the levels of mutations that arise during genome replication and any spontaneous
somatic mosaicism that we could detect suggests that we were able to mutations occurring between cell divisions. The time interval between
detect pre-PGC mutations across at least four rounds of early embry- cell divisions differs markedly throughout the different phases of game-
onic cell division (24 < 20). togenesis, and these mutation rate estimates therefore do not necessar-
Using the data we have generated on the paternal age effect and ily reflect the mutagenicity of genome replication in isolation.
the prevalence of parental somatic mosaicism, we can interrogate We infer that germline DNMs that are mosaic in parental soma
the mutagenicity of different phases of gametogenesis. By assigning will also be mosaic in the germ line; indeed, we observed that the six
mutations to early embryonic cell divisions before PGC specifica- parental somatic mosaic DNMs that were present in more than one
tion, we can estimate a credible range for the mutation rate in early child had significantly higher levels of somatic mosaicism, on average,
cell divisions in parental germ lines. On the basis of sharing of pre- than the other parental somatic mosaic DNMs that were not present
PGC mutations by gametes from the same parent, we can define a in more than one child (P = 0.009, Mann-Whitney test). This suggests
maximum and minimum number of pre-PGC cell divisions within that the extent of somatic mosaicism correlates with the extent of
which the observed pre-PGC mutations must have occurred, and germline mosaicism and, hence, the probability that a DNM will be
from these estimates, an upper and lower bound on the mutation observed recurrently among children.
rate per cell division. Our data suggest that the pre-PGC mutation We identified four DNMs that were shared by siblings and thus
rate per cell division is in a range of ~0.2 to 0.6 (for a haploid genome) are highly likely to be mosaic in the parental germ line, although
in both parental germ lines. The paternal age effect that we observed we observed no evidence for accompanying somatic mosaicism in
implies that a lower mutation rate per cell division, ranging from parental blood. We infer that these mutations may have arisen in early
~0.09 to ~0.17 (~24 paternal mutations per year derived from 23 cell divisions post-PGC specification and thus mosaicism is restricted
cell divisions), operates during post-pubertal spermatogenesis. By to the germ line.
contrast, oogenesis appears to be considerably more mutagenic than Previous studies of the germline mosaicism of sequence variants
post-pubertal spermatogenesis, with a mutation rate per cell division have been largely limited to case studies of sibling recurrence of
of ~0.5 to ~0.7 (with ~1014 maternal mutations arising during ~20 pathogenic DNMs3034. Our estimate of 1.3% for the average recur-
post-PGC cell divisions27). In the paternal germ line, we also need to rence probability is compatible with those empirical studies, but they
consider an intermediate phase of cell division, during the prolifera- are not compatible with recent lower estimates of recurrence risks
tion and differentiation of PGCs to form prespermatogonia during derived from theoretical modeling of the cellular genealogy of the
prenatal development. This phase of spermatogenesis is contempo- germ line35. We note that these recurrent DNMs shared by siblings
raneous with oogenesis in females. By extrapolating the paternal were not randomly distributed across families but were significantly
age effect, we can estimate the total number of paternal mutations (P < 0.01) enriched in one pedigree. This suggests that there may
at puberty (averaging across pedigrees and assuming no maternal also be significant variation across families in patterns of germline
age effect) to be ~19, and, by subtracting the number of pre-PGC mosaicism of DNMs.
mutations (~26 from ~10 divisions), we can estimate the number These results on germline mosaicism have implications for genetic
of paternal mutations that arise during this intermediate phase to be counseling on recurrence risks for families with children with genetic
~1317. It has been estimated that there are ~24 cell divisions during disorders caused by DNMs17. Although the currently used recurrence
this phase27, giving a mutation rate range per cell division of ~0.50.7, risk of ~1% is supported by our findings, our data suggest that this
represents an average across DNMs with very different recurrence http://www.uk10k.org. Funding for UK10K was provided by the Wellcome Trust
risks. Whereas only 1.3% of all DNMs were observed recurrently under award WT091310. Data can be accessed at the European Genome-phenome
Archive (EGA) under accessions EGAS00001000108 and EGAS00001000090.
among siblings, this proportion increased to 24% for DNMs that were
mosaic in >1% of parental blood cells and 50% for DNMs that were AUTHOR CONTRIBUTIONS
mosaic in >6% of parental blood cells. Our data suggest that deep R.R., A.W. and M.E.H. developed analytical methods and/or analyzed sequencing
sequencing of parental blood for pathogenic DNMs seen in children data. R.R. performed mutation rate estimation, family comparison, analysis of
should enable meaningful stratification of families into a substan- germline mosaicism and validation. A.W. performed meta-analysis of the DNMs for
mutational spectrum and methylation status. S.J.L. and R.J.H. contributed toward
tial majority with <1% recurrence risks and a small minority with phasing and the detection and validation of DNMs. L.B.A. performed mutational
recurrence risks that could be at least an order of magnitude higher. signature analysis. S.A.T. contributed to whole-genome data analysis. A.D., A.M.,
Considerably more data will be required to enable more precise D.P. and B.S. provided blood samples for the Scottish Family Health Study. M.R.S.
quantitative estimates of recurrence risks given an observed extent advised on mutational processes. The UK10K Consortium contributed sequences
for meta-data analysis. R.R., A.W. and M.E.H. wrote the manuscript. M.E.H.
of parental somatic mosaicism.
supervised the project.
Our data also show that, in the absence of deep sequencing of
parental somatic tissue(s), knowing the parental origin of a DNM COMPETING FINANCIAL INTERESTS
alters the recurrence risk, with maternal mutations likely having a The authors declare no competing financial interests.
~3- to 4-fold higher recurrence risk, on average, than paternal
Reprints and permissions information is available online at http://www.nature.com/
mutations. As noted previously24, the higher probability of germline reprints/index.html.
mosaicism for maternally derived DNMs results in a higher recur-
rence risk, on average, for DNMs causing X-linked recessive disorders 1. Lindahl, T. & Wood, R.D. Quality control by DNA repair. Science 286, 18971905
than for autosomal dominant disorders. (1999).
2015 Nature America, Inc. All rights reserved.
29. Walter, C.A., Intano, G.W., McCarrey, J.R., McMahan, C.A. & Walter, R.B. Mutation 33. Tajir, M. et al. Germline mosaicism in Rubinstein-Taybi syndrome. Gene 518,
frequency declines during spermatogenesis in young mice but increases in old mice. 476478 (2013).
Proc. Natl. Acad. Sci. USA 95, 1001510019 (1998). 34. Bachetti, T. et al. Recurrence of CCHS associated PHOX2B poly-alanine
30. Liu, G. et al. Maternal germline mosaicism of kinesin family member 21A (KIF21A) expansion mutation due to maternal mosaicism. Pediatr. Pulmonol. 49, E45E47
mutation causes complex phenotypes in a Chinese family with congenital fibrosis (2014).
of the extraocular muscles. Mol. Vis. 20, 1523 (2014). 35. Campbell, I.M. et al. Parent of origin, mosaicism, and recurrence risk: probabilistic
31. Anazi, S., Al-Sabban, E. & Alkuraya, F.S. Gonadal mosaicism as a rare cause of modeling explains the broken symmetry of transmission genetics. Am. J. Hum.
autosomal recessive inheritance. Clin. Genet. 85, 278281 (2014). Genet. 95, 345359 (2014).
32. Dhamija, R. et al. Novel de novo heterozygous FGFR1 mutation in two siblings with 36. Wang, J., Fan, H.C., Behr, B. & Quake, S.R. Genome-wide single-cell analysis of
Hartsfield syndrome: a case of gonadal mosaicism. Am. J. Med. Genet. A. 164A, recombination activity and de novo mutation rates in human sperm. Cell 150,
23562359 (2014). 402412 (2012).
2015 Nature America, Inc. All rights reserved.
passed filtering and resequenced the resulting pulldown library using Illumina than one offspring from the same family.
sequencing to 139 coverage on average (range of 88191). We designed baits Method 2: identification by excess of alternative reads in a parent. Potential
to cover a 200-bp window around each candidate site. The bait design succeeded parental germline mosaic events were further investigated for the 768 validated
for 4,141 sites. To analyze the validation data, we classified each putative DNM DNMs by identifying instances of a significant excess of reads supporting the
into one of three categoriesgermline DNM, inherited variant or false positive alternative allele in one of the parents. To improve our power to detect can-
and evaluated the likelihood of the data under each model. The three models didate germline mosaic sites, we performed an additional MiSeq run of the
are defined below. In addition, 37 of the DNMs were removed after manual custom pulldown library we previously used for validation, which resulted in
inspection in the Integrative Genomics Viewer (IGV) genome browser. an average coverage of 500 for validated DNMs (n = 768). The site-specific
Model 1: germline DNM. We defined the likelihood of the data under the error rate for each DNM was estimated by dividing the total number of reads
DNM model as supporting the alternative allele by the total number of reads in all non-related
individuals from the two families in which the DNM was not discovered.
LL.DNM = Pois(mm , mT e) + Pois(dm , dT e) + Bin(cm , cT e, 0.5) Hence, the probability that the observed number of parental alternative allele
reads resulted from sequencing error was calculated as follows
where mm, dm and cm are the number of reads supporting the mutant allele
(mostly the alternative allele) in the mother, father and child, respectively. pmaternal = Bin(mm , malt + ref , e)
mT, dT and cT are the total number of reads in the mother, father and child,
respectively, and e is the sequencing error rate. ppaternal = Bin( falt , falt + ref , e)
Model 2: inherited variant. The likelihood that the variant is inherited is
defined as where m and f are the number of reads in the mother and father, respectively,
LL.I = max(LL.IFM, LL.IFD, LL.IFMD) alt and ref are the alternative and reference alleles, respectively, and e is the
site-specific error rate. Both maternal and paternal P values for each DNM
where LL.IFM, LL.IFD and LL.IFMD refer to the likelihood that the variant is were adjusted for multiple testing using Bonferroni correction. Sites that were
maternally inherited, paternally inherited or inherited from both parents. significant at adjusted P < 0.05 were considered to be mosaic. In total, 24
mosaic sites were validated using this method. Six of these were also discovered
LL.IFM = Bin(mm , mT , 0.5) + Pois(dm , dT e) + Bin(cm , cT , 0.5) by the sibling recurrence method described above.
LL.IFD = Pois(mm , mT e) + Bin(dm , dT , 0.5) + Bin(cm , cT , 0.5) Estimation of recurrence risk. The probability of an apparent DNM being
present in more than one sibling in the same family was calculated as the
LL.IFMD = Bin(mm , mT , 0.5) + Bin(dm , dT , 0.5) + Bin(cm , cT , 0.5) number of instances of a mutation being shared by two siblings divided by
the number of pairwise comparisons between two siblings in all three families
Model 3: false positive. Model 3 is written as (Supplementary Table 2).
LL.FP = Pois(mm , mT e) + Pois(dm , dT e) + Pois(cm , cT e) Validation of DNMs mosaic in parents. We carried out further independent
validation of 40 candidate parental mosaic DNMs (Supplementary Data
Correction of the mutation rate. The correction accounts for the part of the Set 1) using PacBio amplicon sequencing. These 40 candidate mosaic DNMs
genome that we could not interrogate because of insufficient depth in low- were selected as follows: ten DNMs that were shared by siblings (for six of
complexity regions, filtering procedures to exclude false positives and failed these shared DNMs, we had previously identified a significant parental excess
validation. To take into account the different karyotypes of the male and female of reads for alternative alleles, as described above) and 30 candidate mosaic
genomes, the precise form of the correction depends on the sex of the proband sites that had an excess of reads for alternative alleles in a parents blood,
with nominal P value < 0.05. Note that the set of 30 candidates was based on
Girls : (1 noCvg ) (1 filtered) (1 noVal ppAdjust) nominal significance rather than Bonferroni-corrected significance and so
2 valDNM/genome length represents a less stringent set of candidate mosaic DNMs.
Primers were designed using Primer3 (ref. 39) to generate amplicons with
Boys : (1 noCvg ) (1 filtered) (1 noVal ppAdjust) an average length of 250 bp, with the candidate mosaic site in the middle of
2 valDNM + valDNMX/genome length the amplicon. For each candidate mosaic site, amplicons were prepared for the
(iii) not validated, comprising sites where we had >90% power to detect the wells with single-molecule amplification, then the raw genotype data from
alternative alleles in the mosaic parents but did not detect them. the 48 amplified single molecules were analyzed in two ways. First, haplotype
We classified 29 of the set of 40 candidate sites as parentally mosaic. Four inference was obtained from examining peak height correlations between the
mosaic DNMs were shared by siblings from the same family, but we could not genotype calls for the putative DNM and the adjacent informative SNP, and
observe alternative alleles in either parent in either validation data set (MiSeq the clustering of calls was observed using an in-house script. Second, genotype
or PacBio). Sixteen sites were validated as mosaic, with the mosaic parent calls (or peak heights pertaining to genotype calls) from the same well were
confirmed on both platforms (all of these sites had a significant P value for the counted for each locus, and the haplotype was derived from a likelihood-ratio
MiSeq data after Bonferroni correction). One additional site with a nominally test as detailed by Konfortov et al.41.
significant P value but that was not significant when considering the adjusted
P value for the MiSeq data was confirmed to be mosaic in PacBio data. Two Mutational catalog. We generated a catalog of human DNMs on the basis of
sites were confirmed to be mosaic on the basis of significant adjusted P values previously published high-confidence mutations obtained by whole-genome
from the MiSeq data only, as they failed the PacBio experiment. Six sites were sequencing (Supplementary Table 3). Only single-nucleotide DNMs were
confirmed to be mosaic on the basis of MiSeq data only (with significant included. Where necessary, we used the liftOver tool to convert coordinates
adjusted P values), as their mosaicism level was below detection power in from NCBI Build 36 to Build 37.
PacBio analysis. For the remaining 11 sites, despite their having nominally
significant P values, the adjusted MiSeq P values were not significant and the Mutational spectra and signatures. Mutational spectra were derived directly
PacBio data were inconclusive (Table 1 and Supplementary Data Set 1). from the reference and alternative (or ancestral and derived) alleles at each
In summary, we attempted further experimental validation of 40 candidate variant site. The resulting spectra are composed of the relative frequencies
mosaic sites by conducting deep amplicon sequencing (158 mean coverage of the six distinguishable point mutations (C:G>T:A, T:A>C:G, C:G>A:T,
per individual) in blood from the child, mother and father using the PacBio C:G>G:C, T:A>A:T and T:A>G:T). The significance of the differences between
platform. This validation experiment confirmed the presence of reads for mutational spectra was assessed by comparing the number of mutations for
the alternative allele in parental blood-derived DNA at 100% of the DNMs the six mutation types in the two spectra by means of a 2 test (5 degrees
(n = 9) where the PacBio data had >90% power to detect the level of mosaicism of freedom).
observed in the MiSeq data. Furthermore, we observed 100% concordance Mutational signatures were detected by refitting previously identified
(n = 14) between the parental origin determined by a significant excess of reads consensus signatures of mutational processes16. All possible combinations
for the alternative allele in maternal or paternal blood and that determined by of at least seven mutational signatures were evaluated by minimizing the
phasing the DNM onto a parental haplotype. constrained linear function
N
Correction for mosaic power detection. To estimate the number of mosaic
sites that we failed to detect because of power limitations, we ran 1,000 simula-
min
Exposuresi 0
(
DNMs Signaturei Exposurei )
i =1
tions across our 768 validated DNMs with their given coverage (from MiSeq
sequencing) for a range of mosaicism levels. We calculated the number of Here DNMs and Signaturei represent vectors with 96 components correspond-
sites with >2% mosaicism that we failed to identify. For this calculation, we ing to the six types of SNVs and their immediate sequencing context and
defined two bins for the mosaic level (24% and >4.0%). The average number Exposurei is a non-negative scalar reflecting the number of mutations con-
of undetectable mosaic sites was calculated as a product of the number of tributed by this signature. N reflects the number of signatures being refitted,
mosaic sites and the average detection power for each bin. Hence, the propor- and all possible combinations of consensus mutational signatures for
tion of germline mosaic sites after power adjustment is ~4% (31/768) of the N values between 1 and 7 were examined, resulting in 2,804,011 solutions.
validated DNMs. A model selection framework based on the Akaike information crite-
rion was applied to these solutions to select the optimal decomposition of
Parent of origin. To study the effect of parental age and sex on germline muta- mutational signatures.
tions, we determined the parental origin of each validated germline DNM
using three approaches. Diversity and divergence data. Diversity data were based on 2,453 individuals
First, we used DeNovoGears readpair algorithm1 to obtain parental phas- who were whole-genome sequenced to 68 depth as part of the ALSPAC and
ing information. In short, this algorithm determines the parent of origin if TwinsUK cohorts within the UK10K project. Ancestral alleles were defined
haplotype-informative sites are present in phase with the mutation in the child by a maximum-parsimony approach as those that appeared in the majority
40. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics
mosomal nucleotide composition was performed by counting the number of 25, 20782079 (2009).
each of the four nucleotides in the interrogated regions of each chromosome. 41. Konfortov, B.A., Bankier, A.T. & Dear, P.H. An efficient method for multi-locus
molecular haplotyping. Nucleic Acids Res. 35, e6 (2007).
For each variant, we determined the ancestral and derived alleles. For each 42. Wilson Sayres, M.A., Venditti, C., Pagel, M. & Makova, K.D. Do variations in
variant type, we then divided the number of variants by the number of nucle- substitution rates and male mutation bias correlate with life-history traits? A study
otides that matched the ancestral allele (Supplementary Fig. 5). of 32 mammalian genomes. Evolution 65, 28002815 (2011).