Abstract
Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The SSC benchmarking raw sequencing data can be accessed through NHGRI AnVIL; accession ID: phs000298; databank URL: https://anvilproject.org/data. SSC CNVs can be accessed through SFARIBase (base.sfari.org), accession IDs: SFARI_DS340921 (CNVs). Approval by the Simons Foundation for Autism Research Initiative (SFARI) is required.
Access to the UKBB raw sequencing data and the CNV data generated here will be provided by the UKBB (https://www.ukbiobank.ac.uk).
GENCODE V33 annotation can be found at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz
Code availability
GATK-gCNV is distributed as part of the GATK jar release. For an example workspace on Terra, with recommended parameters, please see https://app.terra.bio/#workspaces/help-gatk/Germline-CNVs-GATK4.
GATK-gCNV evaluation and benchmarking code is available at
https://github.com/broadinstitute/GATK-gCNV-publication/tree/master/evaluation_code.
CMA-CNV Validation code consists of
https://github.com/talkowski-lab/cnv-validation.
GenomeSTRiP version 2.00.1982
http://software.broadinstitute.org/software/genomestrip/.
MoChA version 2022-01-14 WDL https://software.broadinstitute.org/software/mocha/mocha.20220114.wdl.
Change history
23 January 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41588-024-01663-4
References
Marshall, C. R. et al. Structural variation of chromosomes in autism spectrum disorder. Am. J. Hum. Genet. 82, 477–488 (2008).
Egolf, L. E. et al. Germline 16p11.2 microdeletion predisposes to neuroblastoma. Am. J. Hum. Genet. 105, 658–668 (2019).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Ruderfer, D. M. et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat. Genet. 48, 1107–1111 (2016).
Miller, D. T. et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am. J. Hum. Genet. 86, 749–764 (2010).
Srivastava, S. et al. Meta-analysis and multidisciplinary consensus statement: exome sequencing is a first-tier clinical diagnostic test for individuals with neurodevelopmental disorders. Genet. Med. 21, 2413–2421 (2019).
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
Lelieveld, S. H., Spielmann, M., Mundlos, S., Veltman, J. A. & Gilissen, C. Comparison of exome and genome sequencing technologies for the complete capture of protein-coding regions. Hum. Mutat. 36, 815–822 (2015).
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).
Jiang, Y., Oldridge, D. A., Diskin, S. J. & Zhang, N. R. CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res. 43, e39 (2015).
Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).
Packer, J. S. et al. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics 32, 133–135 (2016).
Klambauer, G. et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 40, e69 (2012).
Olshen, A. B., Venkatraman, E. S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet. 54, 1320–1331 (2022).
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
Flannick, J. et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature 570, 71–76 (2019).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440 (2022).
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Werling, D. M. et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 50, 727–736 (2018).
Sanders, S. J. et al. Insights into autism spectrum disorder genomic architecture and biology from 71 risk loci. Neuron 87, 1215–1233 (2015).
Belyeu, J. R. et al. De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families. Am. J. Hum. Genet. 108, 597–607 (2021).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).
Fromer, M. & Purcell, S. M. Using XHMM software to detect copy number variation in whole-exome sequencing data. Curr. Protoc. Hum. Genet. 81, 7.23.1–7.23.21 (2014).
Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).
Plagnol, V. et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics 28, 2747–2754 (2012).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).
Owen, D. et al. Effects of pathogenic CNVs on physical traits in participants of the UK Biobank. BMC Genomics 19, 867 (2018).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell 185, 3041–3055 (2022).
Pan-UK Biobank. Pan-ancestry genetic analysis of the UK Biobank. https://pan.ukbb.broadinstitute.org (2022).
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Auwerx, C. et al. The individual and global impact of copy-number variants on complex human traits. Am. J. Hum. Genet. 109, 647–668 (2022).
Adam, M. P. et al. Alpha-thalassemia. In GeneReviews (Adam, M. P. et. al. eds) (University of Washington, 2005); https://www.ncbi.nlm.nih.gov/books/NBK1435/
Sabath, D. E. et al. Characterization of deletions of the HBA and HBB loci by array comparative genomic hybridization. J. Mol. Diagn. 18, 92–99 (2016).
Anzai, N. et al. The multivalent PDZ domain-containing protein PDZK1 regulates transport activity of renal urate-anion exchanger URAT1 via its C terminus. J. Biol. Chem. 279, 45942–45950 (2004).
Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).
Fitzgerald, T. & Birney, E. CNest: a novel copy number association discovery method uncovers 862 new associations from 200,629 whole-exome sequence datasets in the UK Biobank. Cell Genom. 2, 100167 (2022).
Laver, T. W. et al. SavvyCNV: genome-wide CNV calling from off-target reads. PLoS Comput. Biol. 18, e1009940 (2022).
Martin, A. R. et al. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2021.03.012 (2021).
Salvatier, J., Wiecki, T. V. & Fonnesbeck, C. Probabilistic programming in Python using MyMC3. PeerJ Comput. Sci. 2, e55 (2016).
Acknowledgements
We thank L. Lichtenstein, Y. Farjoun, B. Neale and N. Lennon for insightful discussions at various stages of this project, and S. Zaheri for carefully reviewing and providing feedback on the manuscript. This work was supported by grants from the Simons Foundation for Autism Research Initiative (573206), the SPARK project and SPARK analysis projects (606362 and 608540) and the National Institutes of Health (MH115957, HD081256, HG008895 and HG011450). J.M.F. was supported by an Autism Speaks Postdoctoral Fellowship and R.L.C. was supported by NSF GRFP 2017240332.
Author information
Authors and Affiliations
Contributions
M.B., E.B. and M.E.T. designed these studies and analyses. M.B., D.I.B. and S.K.L. developed and implemented the GATK-gCNV model and the inference algorithm. A.N.S. contributed model enhancements and developed sample-clustering and batch-processing workflows. X.Z., A.N.S. and J.M.F. conducted benchmarking studies of GATK-gCNV performance. A.N.S., M.B. and S.K.L. developed WDL workflows for Terra integration and scalable analysis. M.E.T., J.M.F., E.B., H.B., S.K.L., M.W. and L.D.G. supervised aspects of this project at various stages of development. J.M.F., R.L.C., H.B. and K.J.K. contributed to association analyses. I.W. and J.M.F. generated the CNV callsets. J.M.F., I.W., R.L.C., A.S.J. and H.B. conducted quality control on generated callsets. M.B., J.M.F., R.L.C., H.B. and M.E.T. wrote the manuscript, which was edited by all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Birte Kehr, Christian Marshall and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note, Supplementary Figs. 1–10 and Supplementary Tables 2 and 3.
Supplementary Table 1
CNV-phenotype association analysis in UK Biobank.
Supplementary Data 1
Seven thousand nine hundred eighty-one target regions for PCA batching.
Supplementary Data 2
Standardized set of intervals based on Gencode V33 annotation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Babadi, M., Fu, J.M., Lee, S.K. et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nat Genet 55, 1589–1597 (2023). https://doi.org/10.1038/s41588-023-01449-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01449-0
This article is cited by
-
Diversity and consequences of structural variation in the human genome
Nature Reviews Genetics (2025)
-
Diagnostic yield of exome sequencing-based copy number variation analysis in Mendelian disorders: a clinical application
BMC Medical Genomics (2024)
-
Whole exome sequencing analysis of 167 men with primary infertility
BMC Medical Genomics (2024)
-
A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset
BMC Biology (2024)
-
Rare copy-number variants as modulators of common disease susceptibility
Genome Medicine (2024)