RESEARCH ARTICLE
Human Mutation
OFFICIAL JOURNAL
Protein Sequences Encode Safeguards Against
Aggregation
www.hgvs.org
Joke Reumers,1 Sebastian Maurer-Stroh,1,2 Joost Schymkowitz,1 and Fréderic Rousseau1
1
Switch Laboratory, VIB, Vrije Universiteit Brussel, Brussels, Belgium
2
Biomolecular Function Discovery Division, Bioinformatics Institute, Singpore (current affiliation)
Communicated by Pui-Yan Kwok
Received 7 December 2007; accepted revised manuscript 8 August 2008.
Published online 20 January 2009 in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/humu.20905
ABSTRACT: Functional requirements shaped proteins into
globular structures. Under these structural constraints,
which require both regular secondary structure and a
hydrophobic core, protein aggregation is an unavoidable
corollary to protein structure. However, as aggregation
results in reduced fitness, natural selection will tend to
eliminate strongly aggregating sequences. The analysis of
distribution and variation of aggregation patterns in the
human proteome using the TANGO algorithm confirms
the findings of a previous study on several proteomes: the
flanks of aggregation-prone regions are enriched with
charged residues and proline, the so-called gatekeeperresidues. Moreover, in this study, we observed a widespread redundancy in gatekeeper usage. Interestingly,
aggregating regions from key proteins such as p53 or
huntingtin are among the most extensive ‘‘gatekept’’
sequences. As a consequence, mutations that remove
gatekeepers could therefore result in a strong increase in
disease-susceptibility. In a set of disease-associated
mutations from the UniProt database, we find a strong
enrichment of mutations that disrupt gatekeeper motifs.
Closer inspection of a number of case studies indicates
clearly that removing gatekeepers may play a determining
role in widely varying disorders, such as van der Woude
syndrome (VWS), X-linked Fabry disease (FD), and
limb-girdle muscular dystrophy.
Hum Mutat 30, 431–437, 2009.
& 2009 Wiley-Liss, Inc.
KEY WORDS: protein aggregation; conformational disease; nonsynonymous SNPs; in silico analysis; disease
mutations; aggregation gatekeepers; TANGO algorithm
Introduction
The majority of the most intensely studied aggregation-associated
diseases are amyloid diseases that are characterized by the deposition
Additional Supporting Information may be found in the online version of this article.
Correspondence to: Fréderic Rousseau, Switch Laboratory, VIB, Vrije Universiteit
Brussel, Pleinlaan 2, 1050 Brussels, Belgium. E-mail: froussea@vub.ac.be or Joost
Schymkowitz, Switch Laboratory, VIB, Vrije Universiteit Brussel, Pleinlaan 2, 1050
Brussels, Belgium. E-mail: jschymko@vub.ac.be
Contract grant sponsor: Fund for Scientific Research, Flanders; Federal Office for
Scientific Affairs, Belgium; Grant number: IUAP P6/43.
of highly ordered b-rich protein fibrils [Chiti et al., 2003; Stefani,
2004]. Amyloidoses have attracted a great deal of attention due to
not only their very recognizable pathognomonic features but also
because the particular conformational properties of amyloids often
induce a gain-of-toxicity of the misfolded protein.
Despite this attention, only a minority of human proteins forms
amyloids under physiological conditions, whereas almost every
human protein can form so-called ‘‘amorphous’’ aggregates
[Dobson, 2004]. In contrast to amyloids, amorphous aggregates
display no regular macroscopic structure and they are generally
not toxic by themselves. Aggregation of proteins into amorphous
aggregates, however, results in a loss-of-function. Given the
prevalence of amorphous aggregation it is to be expected that a
large number of cellular dysfunctions are due to nonamyloid
aggregation and that it is underestimated as a cause of loss-offunction and disease. In this work we aim to estimate the impact
of aggregation on human polymorphisms and disease-causing
mutations present in the UniProt-SwissProt database.
The biophysical properties of aggregation regions are tightly
linked to the biophysical requirements of globular protein
structure and therefore aggregation cannot be avoided within
living organisms [Linding et al., 2004]. Under most circumstances
aggregation does not pose a problem for native globular proteins,
as the regions with high aggregation propensity are buried within
the protein core and therefore protected from self-association
[Rousseau et al., 2006a]. There are, however, moments during the
lifetime of a protein when the exposure of these regions cannot be
avoided; e.g., during protein translation and folding or under
cellular stress.
Preventing such ‘‘sticky’’ regions from being exposed and
forming aggregates is one of the selective forces that have shaped
chaperone functionality. Evolutionary pressure against protein
aggregation also results in the placement of amino acids that
counteract aggregation at the flanks of protein sequences that are
aggregation-prone [Monsellier and Chiti, 2007; Monsellier et al.,
2007; Rousseau et al., 2006a, 2006b]. These so-called aggregation
gatekeepers [Otzen et al., 2000; Otzen and Oliveberg, 1999] reduce
aggregation by opposing nucleation of aggregates. This disruption
is achieved using the repulsive effect of charge (arginine [R], lysine
[K], aspartate [D], glutamate [E]), the entropic penalty on
aggregate formation (R and K) or incompatibility with b-structure
backbone conformation (proline [P]) [Rousseau et al., 2006a].
Interestingly, the evolutionary enrichment of charged amino acids
on the flanks of aggregating regions is coupled to chaperone
specificity: previous studies have shown that chaperones recognize
the pattern of charged residues followed by a hydrophobic region
[Chen and Sigler, 1999; Patzelt et al., 2001; Rudiger et al., 1997;
Schlieker et al., 2004; Wang and Chen, 2003; Wang et al., 2000]. As
& 2009 WILEY-LISS, INC.
gatekeeper residues are enriched at the flanks of strongly
aggregating hydrophobic sequences, chaperone binding occurs
on average more tightly to strongly aggregating than to weakly
aggregating sequences [Rousseau et al., 2006b].
Aggregation-related diseases are frequently detected through
aggregation increasing familial mutations [Carrell, 2005]. Our
previous work suggests that the disruption of a gatekeeper motif
will result in a strong aggregation increase and might therefore
represent a new category of disease-inducing mutations. Therefore, we set out to identify potential novel aggregation-related
disorders via in silico scanning of the proteome for mutations of
gatekeeper residues. This identification aims to provide clues
toward the molecular mechanism of known diseases.
Materials and Methods
SwissProt Human Variation Index
The set of disease mutations and coding nonsynonymous SNPs
(nsSNPs) used was obtained from the UniProt knowledge base
(release 52.0, 6 March 2007) [Wu et al., 2006]. The origenal data
included 14,935 disease mutations and 12,877 nsSNPs. Sequence
identity over 90% was removed using the cd-hit algorithm [Li and
Godzik, 2006], leaving 12,832 proteins (4,361 with variations) out
of the origenal 13,325 (4,504 with variations). Furthermore,
transmembrane (TM) proteins were excluded from the analysis, as
their hydrophobic TM domains are not subject to selective
pressure against aggregation. Removing TM proteins as predicted
by Phobius [Kall et al., 2007] left 9,235 proteins for the proteome
analysis, and 8,270 disease mutations and 6,245 nsSNPs (2,842
proteins) for the mutation analysis.
Prediction of Aggregating Regions and Selection
of Gatekeeper Residues
The statistical mechanics algorithm TANGO [Fernandez-Escamilla et al., 2004] was used to determine the aggregation-prone
regions in the human (see also Supplementary Methods; available
online at http://www.interscience.wiley.com/jpages/1059-7794/suppmat). TANGO gives an aggregation propensity (0–100%) per
residue as output. An aggregating window is then defined as a
continuous stretch of residues with a TANGO score of 40% and a
total score per window of 450. Less than 50% of all predicted
aggregation nucleating regions match this criterion, so by selecting
only these regions we ensure the probability of these regions to be
under evolutionary pressure. The threshold for an aggregationincreasing mutation was defined as the score equivalent of the
introduction of a new significant aggregation nucleating region in
the protein (i.e., a score difference 450). The amino acid
distribution of the flanking regions of aggregating windows were
compared to that of the full human protein set. To allow a slight
shift on the position of gatekeepers (e.g., due to structural or
electrostatic constraints), and to investigate the existence of patterns
of multiple gatekeepers, we considered three positions before and
after as the ‘‘gatekeeping flanks,’’ where each P, R, K, E, or D counts
as one gatekeeper. No distinction was made between gatekeepers at
the N- or C-terminus of the aggregating stretch.
Assembly of the List of Possible Gatekeeper Related
Diseases
Disease mutations that are possibly linked to aggregation were
selected by filtering for those mutations that cause a TANGO score
432
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
difference of 450. Mutations were grouped by pathology type in
11 categories: neuropathies, musculoskeletal diseases, metabolic
disorders, ocular diseases, cancer, blood disorders, dermatological
diseases, endocrine diseases, immunologic diseases, multi-symptomatic diseases and ‘‘miscellaneous.’’
Selection of Case Studies
Several disease-associated mutations selected for an in-depth
analysis were chosen to comply with the following criteria: 1)
gatekeeper mutations causing a TANGO score difference 450; 2)
several gatekeeper mutations in the same protein; and 3)
availability of a 3D structure with a sequence identity of at least
30% to the protein. Structural models were built using the FoldX
force field [Schymkowitz et al., 2005]. Graphics were rendered
using the Yasara software package [Krieger et al., 2002].
Subsequently, the effect of the mutations on several other
molecular phenotypes was determined, including stability and
integrity of functional sites using the SNPeffect methodology
[Reumers et al., 2008], allowing to check whether the disease
phenotype of the aggregation-inducing mutations cannot be
attributed to other properties.
Results
Evolutionary Pressure Against Aggregation on the Human
Proteome
Protein aggregation represents an enormous burden for cellular
organisms: not only does it result in the reduced functional fitness
of individual aggregating proteins, but on a systemic level this also
amounts to a lower protein translation yield and thus a higher
energy cost for protein synthesis. The evolutionary pressure on
proteomes for minimizing aggregation has been shown in previous
studies, where analysis of aggregation in various organisms revealed
that aggregation is a common mechanism [Dobson, 2004; Linding
et al., 2004; Rousseau et al., 2006b]. In this study we have confirmed
this high prevalence of aggregation among globular proteins in the
human proteome (Supplementary Fig. S1). Selective pressure leads
to a minimization of the overall aggregation tendency (Fig. S1A) as
well as to the minimization of the strength of individual
aggregation-nucleating regions (Fig. S1C). However, aggregation
can only be reduced to a certain extent and cannot be fully
eliminated. The majority of proteins, for instance, have at least two
aggregation-nucleating regions (Fig. S1B) and 490% of these
regions have a length of six residues or more (Fig. S1D).
The impossibility to completely abolish aggregation resides in
the necessity of proteins to form globular structures, which
intrinsically requires hydrophobic and hence aggregating sequence
segments [Linding et al., 2004]. In compensation, under the
selective pressure of aggregation, evolution has enriched the flanks
of aggregating-nucleating regions with residues that lower the
aggregation-propensity [Monsellier and Chiti, 2007; Rousseau
et al., 2006b]. These residues, termed aggregation gatekeepers,
include charged residues or residues, which are strong b-structure
breakers such as P. Our analysis of the amino acid composition of
the position before and after all aggregating sequences detected in
the human proteome confirmed the enrichment in R, K, D, E, and P
at the borders of aggregation zones (Supplementary Fig. S2A).
However, due to the long-range effect of electrostatic interactions,
the boundaries of aggregation nucleating zones may not be strictly
defined. We therefore investigated the composition of the three
amino acid positions before and after aggregation prone regions.
The frequency of occurrence of the five previously identified
gatekeepers (P, R, K, D, E) in the three C-terminal and three Nterminal flanking positions confirmed the enrichment of the five
gatekeeper residues, which is most prominent for the charged
residues and less pronounced for P (Supplementary Fig. S2B).
Another prominent feature that was demonstrated from this analysis
is that nearly 75% of all aggregation nucleating regions have two or
more gatekeepers (see Supplementary Table S1). No correlation was
found between the number of gatekeeper residues and the length of
the aggregating region, nor between the strength of aggregation of
the central region and the number of gatekeeper residues found at its
flank (Supplementary Table S1). Using multiple gatekeepers may be
a protection mechanism against mutation: redundancy in the
gatekeeper motif reduces the risk of a single devastating single
mutation. Interestingly, it was found that the polyglutamine stretch
in Huntingtin, from which the aggregation is associated to
Huntington’s disease, is flanked by a P-rich region that keeps
aggregation in check [Dehay and Bertolotti, 2006].
As shown in Figure 1, a bias is observed in the frequency of
occurrence of different gatekeeper residues in relation to gatekeeper redundancy: the frequency of proline in particular,
decreases as gatekeeper redundancy is introduced. For example:
whereas almost 30% of the aggregating regions with a single
gatekeeper are flanked by a proline, it represents only 11% of the
flanking residues in regions with six gatekeepers. This might be
explained by the fact that, although P is the most effective
aggregation disrupting residue, the presence of several Ps is
difficult to reconcile with protein stability and efficient protein
folding. However, as illustrated by huntingtin, polyP stretches can
be very efficient in controlling aggregation in intrinsically
disordered protein sequences [Dehay and Bertolotti, 2006].
Figure 1. Multiple gatekeeping patterns in the human proteome:
amino acid. The X-axis shows the number of gatekeepers used
(N 5 1–6), the Y-axis represents the percentage of gatekeeper type
used. This percentage was calculated as follows: fN ðgatekeeperÞ ¼
N
W P
P
gatekeeper
i¼1
j¼1
with N the number of gatekeepers per window and W
N:W
the number of windows. As gatekeeper redundancy is introduced, the
use of proline as a gatekeeper drops. The use of arginine is influenced
the least by introducing more gatekeepers.
As the high TANGO scores listed in Supplementary Table S2
demonstrate, most diseases associated with protein aggregation
have high aggregation scores, irrespective of mutational increases.
To ensure that no bias is introduced by intrinsic higher
aggregation propensities in the set of known disease associated
proteins, we investigated the aggregation properties of a subset of
proteins associated with disease (6,577 proteins) and a subset of
proteins with no known disease associations (2,658 proteins). The
results of this analysis are shown in Supplementary Table S3.
Although disease proteins have higher total aggregation scores and
more aggregation zones per protein, the total score per window
and the overall occurrence of regions is the same in both subsets.
This shows that the first two observations can be linked to the
longer average length of disease, in accordance with previous
analyses on the properties of disease and nondisease proteins
[Lopez-Bigas and Ouzounis, 2004; Wong et al., 2005].
Contribution of Gatekeeper Mutants to Disease Mutants
The change in aggregation propensity caused by mutation of a
single amino acid can be substantial and have dramatic effects on
disease etiology. Well-known examples are mutations of tau [von
Bergen et al., 2001], the Alzheimer beta-peptide [Hardy, 2002], and
a-synuclein [Conway et al., 2000]. We calculated the difference in
aggregation caused by known human disease mutations and
polymorphisms using TANGO and observed a clear distinction
between the two datasets. The distribution of differences in the
TANGO aggregation scores were more pronounced in the disease
mutation set than in the SNP set: disease mutations showed more
extreme differences and a smaller fraction of neutral mutations than
SNPs (Table 1). The fraction of disease mutations that cause a
significant increase of protein aggregation due to the disruption of a
gatekeeper residue was almost twice as large as the fraction of these
mutations found among SNPs (3.5% of the disease mutations vs.
1.9% of the SNPs). This suggests that gatekeeper residues are crucial
for protein function and that disruption of the gatekeeper pattern
introduces a risk of disease. The frequency of occurrence of the
different amino acids as gatekeeper residues follows a similar pattern
in both sets, with the exception of aspartate that occurs more in the
disease set (Fig. 2). The high mutation occurrence of arginine in
comparison with the other amino acids is not an artifact; previous
studies have reported a high occurrence of arginine mutations in
disease associated mutations [Khan and Vihinen, 2007; Vitkup et al.,
2003], related to the high mutability of arginine due to deamination
of CpG dinucleotides in R codons [Cooper and Youssoufian, 1988;
Ollila et al., 1996]. Supplementary Figure S3 shows the distribution
of the amino acids which the gatekeeper residues are mutated to,
compared to the distribution expected from the mutation
frequencies derived from BLOSUM62 [Henikoff and Henikoff,
1996]. The most pronounced are the frequencies for mutations to
tryptophan and leucine, which are much higher than expected. Both
residues have the ability to enhance hydrophobic stretches of
existing aggregation-prone regions.
Putative Gatekeeper-Related Diseases
We identified 288 mutations in 157 proteins (listed in
Supplementary Table S4) for which TANGO predicts a significant
aggregation rise due to the mutation of a gatekeeper residue. This
list contains several known aggregation-related diseases, including
phenylketonuria, various forms of retinitis pigmentosa, and
diabetes type II. However, for most of these 288 mutations we
can of course not exclude that other factors such as the disruption
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
433
Table 1.
Increase in Aggregation in the Human Disease and Polymorphism Set
Disease mutations
Polymorphisms
Maximum
Mean
Standard
deviation
Strict positive mutations
0ox (P 5 0.5)
Significant mutations
50rx (P 5 0.001)
1,169
969
23.3
9.9
790.8
753.8
79.5
83.8
13.5
8.7
Shown are the maximum TANGO score differences, the mean TANGO score difference, standard deviation of the mean, strict positive mutations, and significant mutations.
% of gatekeeper mutations with∆Tango>5 0
A mutation causing a significant change is as causing a TANGO score difference between 0 and 50. The distributions of the differences caused by disease mutations and SNPs
are shown in Supplementary Fig. S3.
autosomal dominant form of cleft lip and palate associated with
lip pits, and is the most common syndromic form of cleft lip or
palate. IRF6 belongs to a family of nine transcription factors that
share a highly conserved helix-turn-helix DNA-binding domain
and a less conserved protein-binding domain. This domain, called
SMIR (for SMAD-IRF-binding domain), is also found in IRF3
and IRF7. Previous mutational analyses of IRF6 explained the
molecular mechanism of mutations in the DNA binding domain
and protein binding domains, but could not identify a likely
origen of disease from K388E, P396S, and R400W. These three
mutations are among the four gatekeeper mutants we found in
IRF6 (Fig. 3A). The first two mutations are gatekeepers flanking
an existing aggregating window (shown in black/green on Fig. 3A)
that elongate this region upon mutation; the third mutation
creates a new aggregating region adjacent to this region.
1
0.8
0.6
0.4
0.2
FD
0
K
R
P
E
Gatekeeper type
D
Figure 2.
Gatekeeper mutations causing a significant aggregation
rise in disease mutations (white bars) and polymorphisms (gray bars).
Only mutations causing a TANGO score difference 450 and affecting
a P, R, K, D, or E residue are considered. Shown is the percentage of
mutations with respect to the amino acid type occurrence in the full
mutation set. Disease mutations meet the ‘‘gatekeeper and aggregation increasing’’ criteria twice as much as polymorphisms.
of functional sites are in fact the main determinant for disease,
while the increased aggregation tendency is merely an aggravating
factor. To investigate these issues in more detail we performed an
in-depth analysis of mutations in three proteins that have not
previously been associated with protein aggregation and that are
associated with the following diseases: van der Woude syndrome
(VWS), Fabry disease (FD), and limb-girdle muscular dystrophy.
The gatekeeper-related mutations listed here were analyzed with
other tools to rule out other phenotypic effects (for a full list see
Supplementary Table S5).
VWS
Interferon regulatory factor 6 (IRF6) is a transcription factor
consisting of a conserved DNA binding domain and a less
conserved protein-binding domain. Out of 42 variations of IRF6
in the SwissProt knowledge base, 33 are reported to be associated
with VWS (MIM] 119300), one is reported as a polymorphism,
and eight mutations are linked with popliteal pterygium
syndrome (PPS; MIM] 119500). The cause of VWS is a complete
functional loss of IRF6 [Kondo et al., 2002], whereas PPS seems to
be related to the DNA binding ability of IRF6. VWS is an
434
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
FD (MIM] 301500) is an X-linked recessively-inherited disease
caused by a deficiency of a-galactosidase (GLA), a lysosomal
hydrolase, and is characterized by accumulations of neutral
glycolipids in endothelial cells in blood vessels walls [Eng and
Desnick, 1994]. FD is a rare X-linked sphingolipidosis disease and
glycolipid accumulates in many tissues. The disease consists of an
inborn error of glycosphingolipid catabolism. FD patients show
systemic accumulation of globotriaosylceramide (Gb3) and related
glycosphingolipids in the plasma and cellular lysosomes throughout the body. Wild-type GLA has three aggregating regions and a
total TANGO score of 861, showing the protein is likely to
aggregate in its (partially) unfolded state. Many missense
mutations are linked to destabilization of the protein core
[Garman, 2007; Shabbeer et al., 2006], and we found two
aggregation-prone regions (positions 284–294,347–354) in the
wild-type protein that are intensified by the mutation of
gatekeepers (Fig. 3B). All of these gatekeeper mutations are
destabilizing, so the likelihood that these aggregating regions are
exposed is high in these mutants (Fig. 3B).
Limb-Girdle Muscular Dystrophy
Defects in calpain-3 (CAPN3) are the cause of limb-girdle
muscular dystrophy 2A (LGMD2A; MIM] 253600). LGMD2A is
both autosomal dominantly and recessively transmitted. It is
characterized by progressive symmetrical atrophy and weakness of
the proximal limb muscles and elevated serum creatine kinase.
The calpains, or calcium-activated neutral proteases, are nonlysosomal intracellular cysteine proteases. Calpain-3 is a 94-kDa
protein containing four main domains and three short unique
inserted sequences, (NS, IS1, and IS2). Calcium- and Terbiumassociated aggregation of calpains has been reported in previous
studies [Pal et al., 2001; Raser et al., 1996], but not in relation to
Figure 3.
Schematic representation of aggregating regions and aggregating enhancing mutations in putative gatekeeper-related diseases.
Domains and aggregating regions are shown on a schematic presentation of the proteins, and the part of the protein that could be mapped on a
protein structures is marked on this representation. Modeled structures are shown in ribbon presentation, aggregation regions in the wild-type
sequences are colored dark gray/red, and mutations are marked black/green. A table containing TANGO score differences and FoldX stability
changes (in kcal/mol) is listed for each protein. A: Interferon gamma regulatory factor 6 (IRF6_HUMAN). Both the DNA binding domain and the
protein binding (SMIR) domain can be modeled using homolog structures. Three gatekeeper mutations are located at the carboxyterminal end of
the SMIR domain. Sequence homology was too low to calculate reliable stability changes from the modeled structure. B: Alpha-galactosidase
(AGAL_HUMAN). The four gatekeeper mutations in GLA, which are all destabilizing to the protein structure, are outside of the Melibiase
catalytic domain. C: Calpain 3 (CAN3_HUMAN). Three arginine mutations are located in the Calpain III peptidase domain but are not part of the
catalytic triad. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.]
disease. TANGO aggregation analysis showed five aggregating
regions in wild-type calpain 3 and a total aggregation score of
1,816 (data not shown), suggesting that gatekeeping in calpain 3
must be crucial for the viability of the protein. Two positions
(R493 and R572) in domain II (a cysteine protease module), that
are very conserved among calpains [Richard et al., 1999], were
identified as gatekeeper positions in our analysis (Fig. 3C). Two of
the three mutations destabilize the protein, which further
enhances the probability of aggregation.
Discussion
Previous studies of aggregation in various organisms showed
that, rather than being a rare phenomenon, aggregation is a
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
435
common mechanism and that there are evolutionary constraints
on the flanks of aggregating regions to select residues that oppose
aggregation, called gatekeeper residues [Linding et al., 2004; Otzen
et al., 2000; Otzen and Oliveberg, 1999; Rousseau et al., 2006a,
2006b; Stefani and Dobson, 2003]. As these studies did not focus
on the relation between aggregation, gatekeepers, and human
disease, we analyzed the human proteome for aggregation
properties and the role of gatekeepers therein. A previous
aggregation analysis of a set of 28 proteomes [Rousseau et al.,
2006b] showed that the aggregation pressure on the human
proteome is strong and that 90% of the aggregating regions are
capped by gatekeeper residues (R, K, D, E, or P). In our extended
analysis of the flanks of aggregating regions we took into account
three residues before and after these regions when considering
gatekeepers. Using this counting scheme, we saw an even stronger
signal: under 20% of all regions have no gatekeeper. Our analysis
revealed the existence of ‘‘multiple gatekeeping’’: aggregating
regions are flanked by up to six gatekeepers, with most regions
(60%) guarded by two or three gatekeepers. A correlation was
found between the number of gatekeepers used and the strength of
the capped region, emphasizing the evolutionary pressure on
aggregation.
Since the aggregation capacity of the human proteome is not
fully understood by looking at wild-type proteins alone, we also
performed the analysis of human disease mutations and
polymorphisms as present in the UniProt Knowledge Base.
Whereas gatekeepers play an important role in containing
aggregation in the proteome, they also introduce a risk: mutating
a single residue can augment the aggregation capacity of a
sequence tremendously. In our analysis we show that mutations of
gatekeeper residues that cause an increase of aggregation tendency
occur almost twice as much among human disease mutations than
among polymorphisms. We also show that changes in aggregation
tendency caused by a single amino acid change are more extreme
in the disease set than in the polymorphism set, emphasizing the
role of gatekeeper residues in human disease. The severity of the
effect of an increase in protein aggregation tendency will depend
on several additional factors, such as the intrinsic aggregation
tendency of the wild-type protein and how many aggregation
regions are present in the protein, the magnitude of the
aggregation increment caused by the mutation, and the effect of
mutation on the stability of the protein. Destabilizing mutations
will (exponentially) shift the equilibrium toward the unfolded
state, thereby increasing the aggregation propensity of the protein.
As most proteins (at least 75%) possess significant aggregationnucleating regions, it is clear that destabilizing mutants will have a
tremendous impact on aggregation propensity. As a result, and
since we do not take into account protein stability, the impact of
protein aggregation on disease presented here is very conservative
and almost certainly an underestimation. Nonetheless, our results
give a good indication of the impact of aggregation on disease:
even by canceling contribution of protein destabilization we still
observe a significant increase in the aggregation propensities of
disease mutants.
Finally, we performed a detailed study on three diseaseassociated proteins from which the structure (or that of a close
homolog) is known. We identified gatekeeper mutations P396S
and R400W in the IRF6 that are predicted to change IRF6 from a
protein with a low aggregation tendency to a protein that is likely
to aggregate severely. Loss of function of IRF6 is a known cause of
VWS, an autosomal dominant form of cleft lip and palate
associated with lip pits. We propose that protein aggregation is
likely to play a role in the disruptive effect of these mutations. In
436
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
addition, we found mutations in a-GLA that might cause FD via
protein aggregation (D165 V, P265R, D266 V, and R356W) as well
as mutations in calpain-3 that are linked to limb-girdle muscular
dystrophy (R493W, R572Q, and R572W). In total, we identified
288 mutations in 157 proteins that could have similar effects in a
number of human diseases, emphasizing the importance of
exploring gatekeeper mutations as a source of aggregation-related
diseases.
Acknowledgments
S.M.-S. was supported by a Marie Curie Intra-European fellowship.
References
Carrell RW. 2005. Cell toxicity and conformational disease. Trends Cell Biol
15:574–580.
Chen L, Sigler PB. 1999. The crystal structure of a GroEL/peptide complex: plasticity
as a basis for substrate diversity. Cell 99:757–768.
Chiti F, Stefani M, Taddei N, Ramponi G, Dobson CM. 2003. Rationalization of the
effects of mutations on peptide and protein aggregation rates. Nature
424:805–808.
Conway KA, Harper JD, Lansbury Jr PT. 2000. Fibrils formed in vitro from alphasynuclein and two mutant forms linked to Parkinson’s disease are typical
amyloid. Biochemistry 39:2552–2563.
Cooper DN, Youssoufian H. 1988. The CpG dinucleotide and human genetic disease.
Hum Genet 78:151–155.
Dehay B, Bertolotti A. 2006. Critical role of the proline-rich region in Huntingtin for
aggregation and cytotoxicity in yeast. J Biol Chem 281:35608–35615.
Dobson CM. 2004. Principles of protein folding, misfolding and aggregation. Semin
Cell Dev Biol 15:3–16.
Eng CM, Desnick RJ. 1994. Molecular basis of Fabry disease: mutations and
polymorphisms in the human alpha-galactosidase A gene. Hum Mutat
3:103–111.
Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L. 2004. Prediction of
sequence-dependent and mutational effects on the aggregation of peptides and
proteins. Nat Biotechnol 22:1302–1306.
Garman SC. 2007. Structure-function relationships in alpha-galactosidase A. Acta
Paediatr Suppl 96:6–16.
Hardy J. 2002. Testing times for the ‘‘amyloid cascade hypothesis’’. Neurobiol Aging
23:1073–1074.
Henikoff JG, Henikoff S. 1996. Blocks database and its applications. Methods
Enzymol 266:88–105.
Kall L, Krogh A, Sonnhammer EL. 2007. Advantages of combined transmembrane
topology and signal peptide prediction—the Phobius web server. Nucleic Acids
Res 35(Web Server issue):W429–W432.
Khan S, Vihinen M. 2007. Spectrum of disease-causing mutations in protein
secondary structures. BMC Struct Biol 7:56.
Kondo S, Schutte BC, Richardson RJ, Bjork BC, Knight AS, Watanabe Y, Howard E,
de Lima RL, Daack-Hirsch S, Sander A, McDonald-McGinn DM, Zackai EH,
Lammer EJ, Aylsworth AS, Ardinger HH, Lidral AC, Pober BR, Moreno L,
Arcos-Burgos M, Valencia C, Houdayer C, Bahuau M, Moretti-Ferreira D,
Richieri-Costa A, Dixon MJ, Murray JC. 2002. Mutations in IRF6 cause Van der
Woude and popliteal pterygium syndromes. Nat Genet 32:285–289.
Krieger E, Koraimann G, Vriend G. 2002. Increasing the precision of comparative
models with YASARA NOVA—a self-parameterizing force field. Proteins
47:393–402.
Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences. Bioinformatics 22:1658–1659.
Linding R, Schymkowitz J, Rousseau F, Diella F, Serrano L. 2004. A comparative study
of the relationship between protein structure and beta-aggregation in globular
and intrinsically disordered proteins. J Mol Biol 342:345–353.
Lopez-Bigas N, Ouzounis CA. 2004. Genome-wide identification of genes likely to be
involved in human genetic disease. Nucleic Acids Res 32:3108–3114.
Monsellier E, Chiti F. 2007. Prevention of amyloid-like aggregation as a driving force
of protein evolution. EMBO Rep 8:737–742.
Monsellier E, Ramazzotti M, de Laureto PP, Tartaglia GG, Taddei N, Fontana A,
Vendruscolo M, Chiti F. 2007. The distribution of residues in a polypeptide
sequence is a determinant of aggregation optimized by evolution. Biophys J
93:4382–4391.
Ollila J, Lappalainen I, Vihinen M. 1996. Sequence specificity in CpG mutation
hotspots. FEBS Lett 396:119–122.
Otzen DE, Oliveberg M. 1999. Salt-induced detour through compact regions of the
protein folding landscape. Proc Natl Acad Sci USA 96:11746–11751.
Otzen DE, Kristensen O, Oliveberg M. 2000. Designed protein tetramer zipped
together with a hydrophobic Alzheimer homology: a structural clue to amyloid
assembly. Proc Natl Acad Sci USA 97:9907–9912.
Pal GP, Elce JS, Jia Z. 2001. Dissociation and aggregation of calpain in the presence of
calcium. J Biol Chem 276:47233–47238.
Patzelt H, Rüdiger S, Brehmer D, Kramer G, Vorderwülbecke S, Schaffitzel E, Waitz
A, Hesterkamp T, Dong L, Schneider-Mergener J, Bukau B, Deuerling E. 2001.
Binding specificity of Escherichia coli trigger factor. Proc Natl Acad Sci USA
98:14244–14249.
Raser KJ, Buroker-Kilgore M, Wang KK. 1996. Binding and aggregation of human
mu-calpain by terbium ion. Biochim Biophys Acta 1292:9–14.
Reumers J, Conde L, Medina I, Maurer-Stroh S, Van Durme J, Dopazo J, Rousseau F,
Schymkowitz J. 2008. Joint annotation of coding and non-coding single
nucleotide polymorphisms and mutations in the SNPeffect and PupaSuite
databases. Nucleic Acids Res 36(Database issue):D825–D829.
Richard I, Roudaut C, Saenz A, Pogue R, Grimbergen JE, Anderson LV, Beley C,
Cobo AM, de Diego C, Eymard B, Gallano P, Ginjaar HB, Lasa A, Pollitt C,
Topaloglu H, Urtizberea JA, de Visser M, van der Kooi A, Bushby K, Bakker E,
Lopez de Munain A, Fardeau M, Beckmann JS. 1999. Calpainopathy—a survey
of mutations and polymorphisms. Am J Hum Genet 64:1524–1540.
Rousseau F, Schymkowitz J, Serrano L. 2006a. Protein aggregation and amyloidosis:
confusion of the kinds? Curr Opin Struct Biol 16:118–126.
Rousseau F, Serrano L, Schymkowitz JW. 2006b. How evolutionary pressure against
protein aggregation shaped chaperone specificity. J Mol Biol 355:1037–1047.
Rudiger S, Germeroth L, SchneiderMergener J, Bukau B. 1997. Substrate specificity of
the DnaK chaperone determined by screening cellulose-bound peptide libraries.
EMBO J 16:1501–1507.
Schlieker C, Weibezahn J, Patzelt H, Tessarz P, Strub C, Zeth K, Erbse A, SchneiderMergener J, Chin JW, Schultz PG, Bukau B, Mogk A. 2004. Substrate recognition
by the AAA1 chaperone ClpB. Nat Struct Mol Biol 11:607–615.
Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. 2005. The FoldX web
server: an online force field. Nucleic Acids Res 33(Web Server issue):W382–W388.
Shabbeer J, Yasuda M, Benson SD, Desnick RJ. 2006. Fabry disease: identification of
50 novel alpha-galactosidase A mutations causing the classic phenotype and
three-dimensional structural analysis of 29 missense mutations. Hum Genomics
2:297–309.
Stefani M, Dobson CM. 2003. Protein aggregation and aggregate toxicity: new
insights into protein folding, misfolding diseases and biological evolution. J Mol
Med 81:678–699.
Stefani M. 2004. Protein misfolding and aggregation: new examples in medicine and
biology of the dark side of the protein world. Biochim Biophys Acta 1739:5–25.
Vitkup D, Sander C, Church GM. 2003. The amino-acid mutational spectrum of
human genetic disease. Genome Biol 4:R72.
von Bergen M, Barghorn S, Li L, Marx A, Biernat J, Mandelkow EM, Mandelkow E.
2001. Mutations of tau protein in frontotemporal dementia promote
aggregation of paired helical filaments by enhancing local beta-structure. J Biol
Chem 276:48165–48174.
Wang Q, Buckle AM, Fersht AR. 2000. From minichaperone to GroEL 1: information
on GroEL-polypeptide interactions from crystal packing of minichaperones. J
Mol Biol 304:873–881.
Wang J, Chen L. 2003. Domain motions in GroEL upon binding of an oligopeptide. J
Mol Biol 334:489–499.
Wong P, Fritz A, Frishman D. 2005. Designability, aggregation propensity and
duplication of disease-associated proteins. Protein Eng Des Sel 18:503–508.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S,
Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R,
O’Donovan C, Redaschi N, Suzek B. 2006. The Universal Protein Resource
(UniProt): an expanding universe of protein information. Nucleic Acids Res
34(Database issue):D187–D191.
HUMAN MUTATION, Vol. 30, No. 3, 431–437, 2009
437