Abstract
De novo mutations occur at substantially different rates depending on genomic location, sequence context and DNA strand. The success of methods to estimate selection intensity, infer demographic history and map rare disease genes, depends strongly on assumptions about the local mutation rate. Here we present Roulette, a genome-wide mutation rate model at basepair resolution that incorporates known determinants of local mutation rate. Roulette is shown to be more accurate than existing models. We use Roulette to refine the estimates of population growth within Europe by incorporating the full range of human mutation rates. The analysis of significant deviations from the model predictions revealed a tenfold increase in mutation rate in nearly all genes transcribed by polymerase III (Pol III), suggesting a new mutagenic mechanism. We also detected an elevated mutation rate within transcription factor binding sites restricted to sites actively used in testis and residing in promoters.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Polymorphism data used in the study is freely available at https://gnomad.broadinstitute.org/.
De novo mutations have been aggregated from supplementary materials to refs. 13,18.
Mutation rate estimates for autosomes http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/.
Shet values, which measure gene constraints, recalculated with the help of Roulette could be found here http://genetics.bwh.harvard.edu/genescores/selection.html.
Code availability
All the code used to perform the analysis is available at https://github.com/vseplyarskiy/Roulette.
References
Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Terekhanova, N. V., Seplyarskiy, V. B., Soldatov, R. A. & Bazykin, G. A. Evolution of local mutation rate and its determinants. Mol. Biol. Evol. 34, 1100–1109 (2017).
Seplyarskiy, V. B. & Sunyaev, S. The origin of human mutation in light of genomic data. Nat. Rev. Genet. 22, 672–686 (2021).
Agarwal, I. & Przeworski, M. Signatures of replication timing, recombination, and sex in the spectrum of rare variants on the human X chromosome and autosomes. Proc. Natl Acad. Sci. USA 116, 17916–17924 (2019).
Seplyarskiy, V. B. et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035 (2021).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Ehrlich, M. et al. DNA cytosine methylation and heat-induced deamination. Biosci. Rep. 6, 387–393 (1986).
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).
Bethune, J., Kleppe, A. & Besenbacher, S. A method to build extended sequence context models of point mutations and indels. Nat. Commun. 13, 7884 (2022).
Fang, Y., Deng, S. & Li, C. A generalizable deep learning framework for inferring fine-scale germline mutation rate maps. Nat. Mach. Intell. 4, 1209–1223 (2022).
Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).
Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).
Marteijn, J. A., Lans, H., Vermeulen, W. & Hoeijmakers, J. H. J. Understanding nucleotide excision repair and its roles in cancer and ageing. Nat. Rev. Mol. Cell Biol. 15, 465–481 (2014).
Seplyarskiy, V. B. et al. Error-prone bypass of DNA lesions during lagging-strand replication is a common source of germline and cancer mutations. Nat. Genet. 51, 36 (2019).
Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
An, J.-Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584 (2020).
Weghorn, D. et al. Applicability of the mutation-selection balance model to population genetics of heterozygous protein-truncating variants in humans. Mol. Biol. Evol. 36, 1701–1710 (2019).
Dukler, N. et al. Extreme purifying selection against point mutations in the human genome. Nat. Commun. 13, 4312 (2022).
Lee, S. Y. et al. The shaping of cancer genomes with the regional impact of mutation processes. Exp. Mol. Med. 54, 1049–1060 (2022).
Xia, B. et al. Widespread transcriptional scanning in the testis modulates gene evolution rates. Cell 180, 248–262 (2020).
Mao, P. et al. ETS transcription factors induce a unique UV damage signature that drives recurrent mutagenesis in melanoma. Nat. Commun. 9, 2626 (2018).
Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259–263 (2016).
Sabarinathan, R. et al. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).
Wakeley, J., Fan, W. L., Koch, E. & Sunyaev, S. Recurrent mutation in the ancestry of a rare variant. Genetics 224, iyad049 (2023).
Hodgkinson, A., Ladoukakis, E. & Eyre-Walker, A. Cryptic variation in the human mutation rate. PLoS Biol. 7, e1000027 (2009).
Seplyarskiy, V. B., Kharchenko, P., Kondrashov, A. S. & Bazykin, G. A. Heterogeneity of the transition/transversion ratio in Drosophila and Hominidae genomes. Mol. Biol. Evol. 29, 1943–1955 (2012).
Johnson, P. L. F. & Hellmann, I. Mutation rate distribution inferred from coincident SNPs and coincident substitutions. Genome Biol. Evol. 3, 842–850 (2011).
Nagelkerke, N. J. D. A note on a general definition of the coefficient of determination. Biometrika 78, 691–692 (1991).
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Gao, F. & Keinan, A. Explosive genetic evidence for explosive human population growth. Curr. Opin. Genet. Dev. 41, 130–139 (2016).
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009).
Crow, J. F. & Kimura, M. An Introduction to Population Genetics Theory (The Blackburn Press, 2009).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Harpak, A., Bhaskar, A. & Pritchard, J. K. Mutation rate variation is a primary determinant of the distribution of allele frequencies in humans. PLoS Genet. 12, e1006489 (2016).
Agarwal, I. & Przeworski, M. Mutation saturation for fitness effects at human CpG sites. eLife 10, e71513 (2021).
Thornlow, B. P. et al. Transfer RNA genes experience exceptionally elevated mutation rates. Proc. Natl Acad. Sci. USA 115, 8996–9001 (2018).
Zhang, X.-O., Gingeras, T. R. & Weng, Z. Genome-wide analysis of polymerase III–transcribed Alu elements suggests cell-type-specific enhancer function. Genome Res. 29, 1402–1414 (2019).
Jinks-Robertson, S. & Bhagwat, A. S. Transcription-associated mutagenesis. Annu. Rev. Genet. 48, 341–359 (2014).
Abascal-Palacios, G. et al. Structural basis of RNA polymerase III transcription initiation. Nature 553, 301–306 (2018).
Reijns, M. A. M. et al. Lagging strand replication shapes the mutational landscape of the genome. Nature 518, 502–506 (2015).
Vierstra, J. et al. Global reference mapping of human transcription factor footprints. Nature 583, 729–736 (2020).
Sasani, T. A. et al. A natural mutator allele shapes mutation spectrum variation in mice. Nature 605, 497–502 (2022).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).
Chen, Y.-H. et al. Transcription shapes DNA replication initiation and termination in human cells. Nat. Struct. Mol. Biol. 26, 67–77 (2019).
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Acknowledgements
We thank J. Wakeley and L. Fan for helpful suggestions on population genetics theory. We thank D. J. Balick for providing a forward Wright–Fisher simulator. This research was supported by National Institutes of Health under grants R35-GM127131, R01-MH101244, U01-HG012009 and R01-HG010372 along with funding from NGM Biopharmaceuticals. D.J.L. was supported by NLM T15LM007092.
Author information
Authors and Affiliations
Contributions
V.S., E.M.K. and D.J.L. analyzed the data. V.S., E.M.K., D.J.L. and S.S. wrote the paper. All authors designed the study and read and corrected the paper. V.S., E.M.K. and D.J.L. have agreed to alternate the order of their names for respective individual citations.
Corresponding author
Ethics declarations
Competing interests
J.S.L. and H.H.L. are employed by NGM Biopharmaceuticals Inc. V.S., E.M.K. and S.R.S. are partially funded by NGM Biopharmaceuticals Inc. D.J.L. declares no competing interests.
Peer review
Peer review information
Nature Genetics thanks Martin Taylor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Effect of replication fork direction on the rate of rare synonymous SNVs.
Four contexts with strongest replication asymmetry. Mutation rate calculated for the regions with the strongest replication fork polarity (top quartile). Mutation rate is relative to the least mutable strand. Error bars show 95% confidence intervals for the ratio of two Poisson variables.
Extended Data Fig. 2 Roulette captures mutation rate variation associated with epigenetic features.
Ten pairs of mutation type and epigenetic features with the strongest effects on mutation rate. To generate bins, we subdivided the genome into five equal size bins by the value of genomic features and then calculated observed and expected mutation rates for each trinucleotide context among synonymous sites. This test was performed on synonymous SNVs and mutation rates were normalized to the rate observed in the first epigenetic bin. RT stands for replication timing. Overall, we analyzed the effect of replication timing, H3k27me3, H3k27me1 and recombination.
Extended Data Fig. 3 Roulette captures accelerated mutation rate in ‘maternal’ regions.
De novo mutation rate inside and outside of maternal regions. Maternal regions are defined as in ref. 5.
Extended Data Fig. 4 Roulette predicts the rate of triallelic SNVs.
Multiple derived alleles could co-occur in the same genomic site. Using Roulette, we predicted the probability of a site containing two derived variants simultaneously (triallelic site) by multiplying the probabilities of each derived allele (this is the correct procedure if derived alleles accumulated independently). In contrast to early studies of multiallelic variants, we do not find deviation from independence.
Extended Data Fig. 5 Pseudo-R2 for noncoding regions.
Pseudo-R2 is calculated for noncoding regions for two datasets: gnomAD v3 and UK Biobank. Since Roulette was trained on noncoding variants from the gnomAD v3, it is expected that Roulette performs better for noncoding variants than synonymous variants. De novo sequencing and UK Biobank population sequencing is an independent dataset from trained data.
Extended Data Fig. 6 An elevated number of de novo mutations at sites with observed SNVs.
Sites were divided into mutation rate bins for the three different models. De novo mutation rates were calculated from whole-genome family sequencing data. Horizontal bars represent 95% Poisson confidence intervals for the de novo mutation rate within each bin. Vertical bars represent 95% confidence intervals for the ratio of Poisson rates between SNV and non-SNV sites within each bin.
Extended Data Fig. 7 Recurrence affects site frequency spectra (SFS).
Proportion of sites in five different classes: monomorphic sites, singletons, doubletons, tripletons and other SNVs with higher allele counts. X-axis shows the per-generation mutation rate, as estimated by Roulette. The dotted line is the expected trend under the infinite sites model.
Extended Data Fig. 8 Roulette performance at different DNA regions, as annotated by ENCODE.
Observed to expected ratio of rare SNVs at different ENCODE annotations. PLS stands for promoters, ELS for enhancers, pPLS and pELS are proximal promoters/enhancers (less than 2 KB from transcription start site), dPLS and dELS (more than 2 KB from transcription start site), DNAse-H3K4me3 are sites that are both hypersensitive to DNase and have signal of H3K4me3, CTCF stands for binding sites of CTCF, multiple labels corresponding to overlapping annotations.
Extended Data Fig. 9 Mutation rate around RNU genes.
Shaded area is 95% Poisson confidence intervals.
Extended Data Fig. 10
Deviation from Roulette’s predictions for three hypermutable classes of genes (RNU, tRNA and Imunoglobulins) and for other sites in the genome (Remaning genome).
Supplementary information
Supplementary Information
Supplementary methods and Supplementary Figs. 1–12.
Supplementary Tables
Supplementary Tables 1–3.
Supplementary Data
Data to draw figures.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Seplyarskiy, V., Koch, E.M., Lee, D.J. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat Genet 55, 2235–2242 (2023). https://doi.org/10.1038/s41588-023-01562-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01562-0