Main

Although the genomic locations of many human hot spots have been identified, an understanding of the relationship between DNA sequence and hot-spot location remains elusive. Previously, using a genome-wide map of recombination hot spots estimated from genetic variation data, we carried out an exhaustive search of short (5- to 9-mer) motifs for enrichment in hot spots and identified two, CCTCCCT and CCCCACCCC, which were strongly overrepresented in a small fraction (10%) of human hot spots, with the first motif particularly over-represented (five- to sixfold) in THE1A/B retrotransposons within hot spots1. Direct evidence for a role of these motifs in hot-spot activity came from studies of polymorphic hot spots, where single nucleotide variants reducing crossover activity in cis at hot spots DNA2 (ref. 2) and NID1 (ref. 3) disrupt motifs CCTCCCT and CCCCACCCC, respectively.

Nevertheless, the presence of either short motif is, by itself, a poor predictor of hot spot location and fails to explain most human hot spots. There are three possible solutions. First, there may be other, completely different and perhaps longer motifs that we failed to identify. Second, the identified motifs may be specific examples of an extended degenerate family of motifs. Third, there may be no cis-acting sequence determinants for the majority of hot spots. To distinguish between these hypotheses, we have used an alternative approach to identifying hot spot–associated motifs that looks for sequence similarity between hot spot–associated regions both in repeats and in unique DNA. Furthermore, we have gained additional power by using recombination hot spots identified from the Phase 2 HapMap4 (22,699 autosomal and 608 chromosome X hot spots mapped to within 5 kb).

The rationale for the approach taken is that any generic hot spot–promoting motif should operate on diverse genetic backgrounds (such as in different repeat families). We first identified classes of repeat elements that are over-represented in hot spots (see Methods) and subsequently searched for motifs that are independently associated with enhanced hot-spot presence on multiple repeat-family backgrounds. This approach revealed the presence of a common 13-bp degenerate motif CCNCCNTNNCCNC, which is related to the previously identified motif CCTCCCT (Table 1, Supplementary Tables 1, 2, 3 and Supplementary Note online). Notably, although repeats carrying the motif showed a narrow peak in average recombination rate centered at the motif, repeats with the motif-lacking consensus showed no such peak (Fig. 1). Consequently, the presence of the motif fully accounts for the enrichment of these repeat elements within hot spots.

Table 1 Repeat elements enriched in hot spots and their hot spot–associated motifs
Figure 1: A common hot-spot motif acts across different repeat families.
figure 1

(a) Average recombination rates around repeat elements THE1, L2 and combined AluY, AluSc, AluSg in the genome (see Supplementary Fig. 2 online for separate plots for each Alu family and the Supplementary Note for a discussion of other Alu families). For each plot, the three lines relate to repeats carrying the identified repeat-specific hot-spot motif (red; the motif indicated by the top sequence), repeats with other sequences that match the degenerate hot-spot motif (blue; the consensus is shown in the middle line) and repeats carrying the consensus for that family (black line; sequence shown at the bottom; numbers indicate sample size). Deviations from the repeat-family consensus in sequence are indicated by colored letters. Note that we allow degeneracy in the repeat-family consensus at positions 8 and 9 of the hot-spot motif consensus. (b) Known hot spots carrying newly identified instances of the hot-spot motif. The center of the DNA3 hot spot7 occurs in an AluSg sequence containing the Alu-specific hot-spot motif, which differs from the Alu consensus at position 7 (repeat-family consensus shown below). The O1 SNP in the MS32 hot spot8, which influences hot-spot activity, occurs in an LTR10A repeat element one base pair 3′ from a match to the hot-spot consensus motif. This motif differs from the LTR10A repeat-family consensus at two positions (indicated in red). (c) Estimated recombination rates around LTR10A elements carrying the same 13-base sequence as at MS32 (red; excluding MS32), other matches to the hot-spot consensus motif (blue) and other sequences (black).

In nonrepeat DNA, using the previously identified1 motif CCTCCCT as a foundation, we identified individual flanking bases within 50 bp that are influential in determining hot-spot occurrence (see Methods). In agreement with our findings for repetitive DNA, this analysis revealed the presence of the 4-mer CCAC two bases downstream from the 7-mer to be the strongest additional determinant of hot-spot occurrence (Fig. 2a; P < 10−30 for all four bases together, by FET). This implicates the core motif CCTCCCTNNCCAC for both repeat (in the case of THE1A/B and L2 elements) and nonrepeat DNA. We observed an almost identical pattern of enrichment of the 13-mer motif in hot spots on the X chromosome (Table 1), indicating that the 13-bp motif operates on multiple backgrounds in both males and females.

Figure 2: The role of flanking sequence and motif degeneracy in determining hot-spot activity.
figure 2

(a) The evidence (−log10 P value using χ2 test) for a difference in base composition at nucleotides surrounding the motif CCTCCCT between narrowly defined hot spots and matched cold spots. The sequence shows the most over-represented base at each position; positions showing significant differences (P < 0.01) are blue, the fixed CCTCCCT motif is shown in red and the other specified bases within the repeat-based 13-bp motif are shown in orange. Vertical dotted lines spaced at 3-bp intervals highlight periodic occurrence of strongly signaled bases. (b) Degeneracy within the core 13-base motif estimated by comparing counts of each motif mismatching exactly 1 bp of the 13-bp core CCTCCCTNNCCAC in hot spots and matched cold spots. The combined height of the stacked letters at each position is proportional to the −log10 P value and the relative height of each letter is proportional to its over-enrichment in hot spot–associated motifs (Supplementary Methods).

Further testing of motifs occurring outside repeats and mismatching a single base of the 13-bp 'core' revealed additional degeneracy within the motif at positions 3, 6 and 12, with mismatches at these three bases still consistent with some hot-spot activity (Fig. 2b and Supplementary Methods online) and a consensus of CCNCCNTNNCCNC. These degenerate positions correspond exactly to the mismatching sites within repeat motifs (Table 1) and motifs mismatching at these locations still showed hot-spot activity, albeit reduced, across the other repeats as well (Fig. 1a). Notably, the polymorphism at the DNA2 hot spot corresponds to the first apparently degenerate position within the motif2, suggesting that site-wise degeneracy may fail to represent more complex dependencies between nucleotides.

In order to estimate what proportion of the 22,699 narrowly defined autosomal hot spots require the presence of the 13-base motif, we applied a maximum likelihood approach (Supplementary Methods). Our approach attempted to account for two facts: first, that motif occurrences only stimulate a hot spot with some probability; and second, that each hot spot (defined to 3–5 kb) can contain several motifs, on different backgrounds, each of which may contribute (we assume independently) to hot-spot activity. We estimate that the location of 41.1% ± 1.4% of all human hot spots is determined by the presence of the motif (95% confidence interval estimated by bootstrapping, see Supplementary Methods). Mechanistically, this means that recruitment of crossover events to a particular hot spot requires the presence of one or more copies of the motif in 41% of hot spots.

It is also important to know how well the presence of the motif predicts hot spots (that is, the extent to which the presence of the motif is sufficient). The penetrance of the motif varied between genetic backgrounds (Supplementary Table 4 online). For example, the presence of CCTCCCTNNCCAC in a THE1A background resulted in a detected hot spot 73% of the time, whereas in unique DNA it led to a detected hot spot 10% of the time. The high predictive power on the THE1A background is, in part, because the context of the repeat element leads to the presence of the most recombinogenic nucleotides outside the motif, as defined in Figure 2a. Overall, we found that only a fraction of hot spots are driven by highly penetrant motif–background combinations (for example, the location of 3.5% of hot spots is determined by motif–background combinations with a relative risk of ten or more; Supplementary Table 4). For the majority of hot spots, other factors (including motif density and the additional context features shown in Figure 2a) must interact with motif presence. Reports of distant or even trans effects on crossover activity5,6 also indicate that the presence of short sequence motifs is not fully sufficient to determine hot-spot activity.

To what extent is the expanded, degenerate motif responsible for determining the location of any of the 17 hot spots identified and studied using direct analysis of human sperm? In addition to the DNA2 hot spot (where the presence of the 7-mer motif was described previously1), both the DNA3 hot spot in the HLA class II region7 and the MS32 hot spot8 contained the degenerate 13-mer motif within a few base pairs of the estimated center (16 bp and 1 bp, respectively). In addition, an exact match to the 13-bp core was found within 300 bp of the estimated center of the HLA hot spot DMB2 (ref. 7). The probability of such proximity between hot-spot center and the degenerate motif occurring by chance alone is 0.0046 for DNA3, 0.0016 for MS32 and 0.059 for DMB2, although note that DMB2 contains the core motif, which occurs only once every 450 kb on average (these P values are not Bonferroni corrected, see the Supplementary Note for additional discussion and supporting evidence). In the MS32 hot spot, sperm typing has identified a single-base mutation within an LTR10A repeat element that associates with hot-spot activity8. This C/G polymorphism is located 1 bp downstream of the hot-spot motif (Fig. 1b). On the basis of our examination of sequence flanking nonrepeat motifs (Fig. 2a), we would predict that mutation to a G allele at this site would reduce crossover activity, as observed. Comparison of other LTR10A elements confirmed that the presence of the motif is specifically associated with a local increase in recombination rate (Fig. 1c).

Our results implicate the 13-mer motif in allelic crossover activity during meiosis. A natural question is whether the motif might also have a role in other forms of recombination or recombination-associated genome rearrangement, including nonallelic homologous recombination (NAHR), minisatellite mutation and repeat-associated deletion and rearrangement. To date, breakpoints of NAHR-generated rearrangements have been mapped at the sequence level for only a few diseases. In several cases, this has revealed strong breakpoint clustering into hot spots within particular genomic repeats9,10,11,12, and in three diseases (NF1 microdeletion13, Charcot-Marie-Tooth disease type 1A (CMT1A) and hereditary neuropathy with pressure palsies14), these coincide with hot spots for allelic crossover. To assess whether the identified motif could be responsible for causing NAHR events, we examined the sequence surrounding NAHR hot spots for the six diseases that satisfy the following conditions: (i) rearrangements occur within an autosome or on the X chromosome; (ii) independent, de novo events with breakpoints mapping inside homologous genomic repeats occur in different individuals with the disease; and (iii) fine mapping of breakpoints has been performed in multiple cases and reveals clustering within the hot-spot region (see Methods). In all cases, the degenerate motif CCNCCNTNNCCNC occurred within the hot-spot region of the appropriate low complexity repeat (Supplementary Fig. 1 online, P = 0.00055; Supplementary Note). Examination of secondary, weaker hot spots for Smith-Magenis syndrome and NF1 did not reveal motif presence. The case of X-linked ichthyosis is particularly notable. This recessive skin disorder (incidence of 1 in 5,000 live births) is caused by deletions of the STS gene resulting from NAHR between two of four genomic repeats on chromosome X, each of which carries a copy of the VCX gene and a tandem repeat of the core 13-mer motif (making this the most dense concentration of the 13-mer in the genome at the megabase scale, Fig. 3). Fine mapping of four breakpoints12 has revealed that all occurred precisely within the motif-rich tandem repeats.

Figure 3: Hot-spot sequence motifs at STS deletion hot spot and common mitochondrial deletion endpoints.
figure 3

(a) The density of exact matches to the motif CCTCCCTNNCCAC in 5-kb windows along the first 20 Mb of chromosome X shows four grouped clusters of motif occurrences (motifs on the plus (red) and minus (blue) strands are shown separately). (b) Each motif cluster corresponds to a tandem repeat region downstream of a VCX gene family member; alternating repeats are colored in blue and green. Shown is the genome reference sequence for the first seven repeats downstream of the VCX3A gene. Matches to the hot-spot motif occur once per repeat and are shown red, underlined (mismatching bases, lower case). (c) Nonallelic homologous recombination (NAHR) between two of four homologous repeats, each containing VCX genes, removes the STS gene (red) and causes X-linked ichthyosis. The four repeats are marked as VCXxx and arrows show their orientation. NAHR between directly oriented VCX3A and VCX can delete STS. The lower plot shows the architecture of the VCX3A-containing homologous repeat (thick gray line), including the gene itself and the downstream motif-containing tandem repeat within which deletion breakpoints have been shown to cluster12. (d) The mtDNA 'common deletion'. The top line shows normal 5′ mtDNA surrounding base 8,471, the bottom line 3′ mtDNA surrounding base 13,447 and the middle sequence deletion carrying mtDNA (base matches shown by lines). The deletion occurs within a 13-bp direct repeat (underlined) of which 12 bp overlap almost exact matches to the hot-spot motif (red).

Mutational processes at hypervariable human minisatellites have been examined in depth for eight minisatellites15,16,17,18,19,20,21,22. These loci broadly fall into two classes. In one class, mutations patterns suggest initiation outside the repeat array because there is no correlation between array length and mutation rate and there is (except for the insulin minisatellite) a strong 'polarity' whereby mutation events cluster at one end of the minisatellite19,20,22. In contrast, in the other class most mutations seem to be initiated within the array itself, with a strong correlation between array length and mutation rate and no apparent polarization (Table 2). Notably, we found that for every locus in the second class the minisatellite repeat unit contained a region almost perfectly matching the core hot-spot motif (Table 2). The presence of part of the motif, CCTCCCT, within CEB1 was previously noted23. To ask whether this was likely to have occurred by chance, we calculated the motif occurrence in other human tandem repeat sequences of the same repeat length and GC content. We found that a motif match at all four loci is extremely unlikely to occur by chance (P = 10−7 via permutation test). In contrast, none of the minisatellites in the first class showed a match to the motif, consistent with event initiation at flanking hot spots outside the array. One such locus is MS32, where the motif explains the flanking hot spot (Fig. 1b). For both minisatellite classes we found a strong local association with elevated recombination rate (Fig. 4).

Table 2 Hypervariable human minisatellites and hot-spot motifs
Figure 4: Recombination and hypermutable minisatellites.
figure 4

The plot shows median recombination rate (calculated in non-overlapping windows of 250 bp) around 9 of 10 previously identified hypermutable human minisatellites28, excluding minisatellite CEB1 owing to a lack of typed flanking SNPs in HapMap. The highest median rate corresponds precisely to the minisatellite location. Only minisatellite CEB36 has a recombination rate estimate below the genome-wide average; all others have a marked local increase in recombination rate focused at or near the minisatellite.

An intriguing association between the hot-spot motif and a recurrent genome rearrangement is the 'common deletion' in mitochondria. This deletion is the most common mitochondrial rearrangement, can result in Pearson's syndrome, CPEO and Kearns Sayre syndrome and also accumulates within cells during normal human aging24. The deletion event is mediated by two 13-bp direct repeats separated by 4,977 bp, which have only one mismatch to the canonical 13-mer motif (Fig. 3d). Although mitochondria do not undergo meiosis, the motif might cause deletions by stimulating the formation of double-stranded breaks during mitochondrial replication, as has been shown experimentally in mice25.

Our results provide the first evidence that a substantial fraction of human recombination hot spots share a common mechanism. Does the nature of the motif offer any clues as to the molecular basis for recombination hot spots? A notable feature of the degeneracy both within and beyond the 13-mer core is a threefold periodicity (Fig. 2a). This pattern is unlikely to reflect coding sequence because hot spots actively avoid coding regions1,26. Rather, the pattern might reflect direct interaction with flanking DNA of a motif-binding protein. The periodicity might reflect cooperative binding of a protein interacting with 3-bp DNA units, as occurs for RAD51, which promotes DNA strand exchange in humans27. However, the spacing is also reminiscent of the 3-bp binding unit of individual fingers within zinc-finger binding proteins, which can possess long consensus sequences. The identification of factors that interact with the hot-spot motif should provide further insight into the process of human recombination and its evolution.

Methods

Generating hot spots and matched cold spots.

We estimated hot spots from the HapMap Phase II data as previously described1,4. For each of the 34,142 resultant hot spots, we identified an identically sized cold-spot region on the same chromosome where there was no evidence (P = 1.0) of excess recombination activity relative to the surrounding DNA. Each cold spot was required to match the paired hot spot in terms of local GC content (to within 10%) and SNP density (to within 10%), and we chose the closest possible cold spot conditional on these constraints. We matched 99.8% of the hot spots in this manner (the remaining 0.2% were excluded from comparisons).

Testing for over- and under-representation of repeat elements in hot spots.

Using the locations of all identified repeats (from the RepeatMasker track of the UCSC browser, May 2004 assembly, hg17), we tested each repeat class and family and individual repeats for differences in the extent of overlap with narrowly localized (5 kb or less in size) hot spots and cold spots. For each repeat type, we recorded the number of hot spots (and cold spots) overlapping at least one repeat of the specified type, and compared the two totals via a binomial test. P values obtained were Bonferroni corrected for multiple testing using the number of repeats represented at least 25 times in either hot spots or cold spots (Supplementary Tables 1–3).

Motif testing in repeat backgrounds.

We tested for hot spot–associated motifs separately within the following repeat backgrounds with at least 25 copies in hot spots or cold spots and showing P < 0.01: THE1B, THE1A, GA-rich and CT-rich (combined), L2, AluY, THE1A-int, MIRb, C-rich and G-rich combined, LTR49, MIR, AluSg, Tigger2a, MER61A, LTR5B, MLT1D, polypurine, LTR1, AluSg1 and (CCA)n and (TGG)n combined. We also tested several backgrounds similar to over-represented backgrounds: THE1C, THE1D, Alu, AluJ/FLAM, AluJb, AluJo, AluS, AluSc, AluSg/x, AluSp, AluSp/q, AluSq, AluSq/x, AluSx, AluYa8, AluYb9 and MLT1C, yielding a total of 36 backgrounds to test. For each background, we created a 'hot' set of all occurrences of the repeat overlapping a narrowly defined hot spot. We also created a comparable 'cold' set containing occurrences of the repeat not overlapping any hot spot, with the maximal number of 'cold' repeats possible added from each chromosome in turn so as to match the size distribution of the 'hot' repeats (fraction in successive 10% size bins). On each background, we tested for differences between hot spots and cold spots for every possible nondegenerate DNA motif of length 5–9 bp via Fisher's exact test. Within each motif size, we then applied Bonferroni correction for the number of motifs, and recorded motifs showing P < 0.05. Over-represented motifs of length 7 or more were mapped against the consensus sequence for each element. This typically identified a series of overlapping segments within each element, the union of which is shown in Table 1. We tested the X chromosome separately from the autosomes.

Identifying hot-spot motifs in nonrepeat DNA.

We identified all autosomal occurrences of CCTCCCT in nonrepeat DNA surrounded by at least 50 bp of nonrepeat DNA on each side and thinned occurrences to give a minimum separation of 100 bp. We compared base composition at given positions relative to the motif with a χ2 test (3 degrees of freedom), to produce the P values shown in Figure 2a. This approach showed an enrichment of CCAC 2 bp downstream from the CCTCCCT motif, as observed in repeat elements. Testing this 4-mer motif at the same location revealed an even stronger signal for enrichment of CCAC in hot-spot cases via a χ2 test (OR = 2.2, P < 10−30). This enrichment was the strongest for any 4-bp motif at any site within the 50 bp surrounds of the CCTCCCT motif (results not shown).

Motif degeneracy and estimating the proportion of hot spots explained by the motif.

Details of these analyses are available in the Supplementary Methods.

Analysis of hypermutable minisatellites.

We considered recombination rate estimates surrounding ten of the most mutable human minisatellites known (identified in ref. 28). One of these, CEB1, was excluded because of low SNP density in HapMap. Eight human minisatellites have previously been examined using minisatellite variant repeat mapping by PCR15,16,17,18,19,20,21,22 (MVR-PCR): B6.7, CEB1, MS1, MS32, MS205 and the insulin minisatellite, where male germ-line mutations have been studied in sperm, and MS31 and D7S22, where pedigree mutants have been studied. We observed an occurrence of the hot-spot motif (mismatching one base in each case) within each of the four minisatellites where previous minisatellite variant repeat mapping suggested initiation of events within the repeat array. To test whether this was evidence of enrichment of the hot-spot motif, we resampled 107 sets of four repeat sequences from the collection of autosomal tandem repeats (with at least 3 repeats and at least 88% homology to the repeat consensus, as seen in the set studied here—all eight MVR-PCR studied minisatellites were included in this collection). The tandem repeats were downloaded from the Simple Repeats track of the UCSC genome browser. Each resampled repeat was chosen to match the GC content and repeat unit size (both within 5%) of the corresponding MVR-PCR–studied minisatellite. We counted how frequently all four resampled repeats matched the motif so closely, yielding P = 1 × 10−7 for the observed data. Testing via permutation of the consensus repeat unit within each minisatellite gave a similar result (P = 2 × 10−7).

Examination of sequence surrounds at five NAHR hot spots.

We used previous literature9,10,11,12,13,14,29,30 to identify six major disease-related nonallelic homologous recombination hot spots (CMT1A, NF1, Sotos syndrome, Smith-Magenis syndrome (SMS), Williams-Beuren syndrome and X-linked ichthyosis). Each hot spot occurs within some homologous pair of low-copy repeats. The sequences were obtained for each respective low-copy repeat, as defined using the Segmental Duplications track of the UCSC genome browser. For each region we identified all copies of the degenerate hot-spot motif CCNCCNTNNCCNC within the first low-copy repeat (an arbitrary choice) of the pair involved in NAHR events. Finally, we curated motif occurrences, defining occurrences within Alu elements but outside of AluY, AluSc or AluSg elements as likely to be inactive in promoting recombination. This annotation was used to produce Supplementary Figure 1. To test whether our observation of degenerate motifs within all five hot spots was expected by chance, separately within each LCR we resampled 10,000 hot-spot positions (chosen so that all of the hot spot lay within the LCR), and calculated the proportion of cases the hot spot contained a putatively active copy of the motif. Multiplying these P values together yielded P = 0.00055. For further discussion of the motifs found within each hot spot, see the Supplementary Note.

Note: Supplementary information is available on the Nature Genetics website.