Introduction

Each copy of the human genome is unique and differs in sequence from any other copy in the population by roughly 1 in 1,250 nucleotides1,2. This variation in DNA sequence influences individual characteristics such as physical appearance, susceptibility to disease and response to medical treatments. Sequence polymorphism also represents a fossil record of the history and structure of ancestral populations. Thus, a central goal of medical and population genetics is to understand the patterns and determinants of sequence variation in the human population.

When averaged across windows of 200 kb, rates of heterozygosity show up to a tenfold variation1. But the variation of polymorphism rates at the finer scale (<100 kb) typical of individual genes has not been described. To understand how the human genome varies at this finer scale, we have developed a framework for interpreting population genetic data and have applied it to two genome-wide data sets. This analysis provides three empirical results. First, we show that gene history is quantitatively the main force responsible for local patterns of human genome sequence variation. Second, we make the first genome-wide measurement of the correlations of nearby alleles (linkage disequilibrium). Last, we show that extreme variability in the recombination rate at a fine scale (<100 kb) is a general and major determinant of local patterns of human genome sequence variation.

Results

Regions of high and low polymorphism

We analyzed data from two genome-wide polymorphism discovery projects1 that together yielded a collection of 1.42 million single-nucleotide polymorphisms (SNPs). The SNP Consortium (TSC) selected clones of different sizes from libraries of genomic DNA (constructed from a multiethnic group of 24 individuals) and sequenced the cloned fragments from either one end ('single reads') or both ends ('paired-end reads'). SNPs were identified1,3,4 by comparing the resulting sequence with that of the draft genome produced by the Human Genome Project5. The 'BAC overlap' projects1 identified SNPs by comparing the genomic sequences5 of different individuals that were cloned into bacterial artificial chromosomes (BACs). The data sets included in this analysis were each very large and spanned the human genome, with the TSC data comprising 729 Mb (571,500 heterozygous positions) and the BAC overlap data comprising 46 Mb (37,300 heterozygous positions). When analyzed by a single suite of validated computational methods1,3,4, the average polymorphism rate across the genome was similar between the two projects and in agreement with previous estimates2,3,6,7,8,9,10: 8.01 × 10−4 for the TSC data and 7.78 × 10−4 for the BAC overlaps (Table 1).

Table 1 Summary of statistics for the data sets

A previous analysis of the TSC data examined polymorphism rate averaged over windows of 200 kb. The polymorphism rate was found to vary up to tenfold across regions1. This observed variation could not be explained simply by random sampling; that is, the magnitude of fluctuation in the polymorphism rate across loci was much greater than would be expected if the underlying rate of sequence diversity were constant across all sites1.

To better understand this variation, we examined the fluctuation in polymorphism rates at a finer scale (Methods). Despite the significant variation in heterozygosity observed at a coarse scale, we found that nearby regions show very similar rates of polymorphism (Fig. 1a). For example, segments separated by a physical distance (d) of only a few hundred base pairs show substantial correlation in the observed polymorphism rate, ρ([π^]x,[π^]x+d) = 0.28. (The quantity ρ(πxx+d) is formally the 'autocorrelation' in polymorphism rate over distance d; we refer to it here as the 'correlation' for simplicity.) When corrected for the imprecision in our estimate of [π^] at each location (resulting from stochastic variation in the mutational process and the short length of sequence examined in a single sequencing read), the correlation in heterozygosity is nearly complete: ρ(πxx+d) = 0.97 for d = 100 bp. That is, the underlying rate of sequence variation is very similar for closely linked sites. The correlation declines with distance, however, falling to half its maximum over a distance of about 8 kb (Fig. 1b). The correlation remains significant even at distances of 100 kb (ρ(πxx+d) = 0.20 ± 0.03), although no correlation is observed for unlinked sites (ρ(πxx+d) < 0.001). The size and length of the correlation are probably general, given the large amount of data examined and the agreement between the results from the two polymorphism discovery projects (Fig. 1b).

Figure 1: Correlation in heterozygosity.
figure 1

a, Heterozygosity (π^) of individual sequence reads for TSC data compared with heterozygosity of the flanking sequence, which is defined as a region of 2.5 kb on each side of a read. b, Correlation in heterozygosity (corrected for stochastic variance; Methods) as a function of distance for the TSC and BAC overlap data.

Contribution of gene history and mutation rate

To explain why the tendency to be rich or poor in sequence variation persists for distances of 10–100 kb across the human genome, we considered the mechanisms that are responsible for sequence variation. At any locus in the genome, the rate of polymorphism is shaped by only two forces: the local gene history and the local mutation rate. (Natural selection, for example, acts by altering the history of a locus.) 'Gene history' refers to the genealogical relationships among copies of a locus in the current population, which can vary markedly across the genome. For any two copies of a given locus, the number of generations since their shared ancestor, and thus the historical opportunity for mutation, is τ. The expected neutral rate of sequence differences between two copies of a locus11 is simply 2τ (the number of generations in which mutations could have occurred) multiplied by μ, the mutation rate per generation at that locus. Thus, the correlation in polymorphism rates across neighboring sites must be explained by persistence over each region of a similar gene history (τ), a similar mutation rate (μ), or both.

The extent to which particular values of τ persist is determined by the history of recombination (in the ancestors of the current sample) across each region. In the absence of meiotic recombination along a chromosome, genomic segments are inherited en bloc from generation to generation and thus share a single genealogical history (τ) across their length. By contrast, recombination events juxtapose neighboring chromosomal segments that have different histories, which disrupts the correlation of τ with distance. That is, over short distances (such that historical recombination would be unusual), local values of τ are expected to be nearly identical. With increasing physical distance, it becomes likely that one or more recombination events will have occurred in the history of the sample, and thus the correlation in τ is expected to decline.

The properties of the correlation in τ have been well described in theoretical population genetics under the assumption of a constant-sized, freely mixing population with a uniform recombination rate12,13,14,15,16,17. But previously it has not been possible to measure the empirical distribution of τ over distance in any organism. Similarly, although it is known that mutation rates can vary for different classes of sites (such as CpG dinucleotides), the empirical distribution of mutation rate variation over large genomic regions has not been reported. Thus, to understand the observed correlation in polymorphism rates with distance, we needed to measure the underlying variation in both gene histories and mutation rate.

We first wanted to show directly that the long-range correlation in polymorphism rate is, at least in part, attributable to variation in gene history. We examined the correlation in polymorphism rates for nearby loci sampled according to two different protocols (Fig. 2). We reasoned that segments carried on the same chromosome in the current population would share more similar genealogical histories than would segments separated by the same distance but carried on different chromosomes in the population. We call these two types of comparison cis (if the two segments are carried on the same physical copy of the chromosome) and trans (if the two segments are carried on different physical copies; Fig. 2). If gene history is a principal factor in local differences in polymorphism rate, then the correlation in cis should be greater in magnitude than that in trans. By contrast, if variation in mutation rate is primarily responsible for the local persistence in heterozygosity, then cis and trans measurements should be very similar. The physical distance over which differences in cis and trans persist should indicate the span over which shared history persists in the human genome.

Figure 2: Cis versus trans comparisons.
figure 2

Shown is a simple genealogical history with three samples chromosomes, A, B and C. In a cis comparison, two chromosomes are aligned (A and B) and heterozygosity is compared at two segments on each chromosome separated by a distance d. In a trans comparison, a single chromosome (in this case, B) is compared with two independently sampled chromosomes (A and C) at each of the two segments. In the absence of recombination in the history of either chromosome, the time to the common ancestor at each of the two segments are identical for cis comparisons (τ1 = τ2) but not necessarily for trans comparisons.

Empirically, we found that the correlation in cis was much stronger than that in trans for both data sets (Fig. 3a). This shows directly that gene history has a great effect on the observed correlation. In addition, the cis correlation was greater than the trans correlation for all distances examined (Fig. 3a), which indicates that the same genealogical history is often preserved across substantial distances in the human genome.

Figure 3: Impact of gene history on the correlation in heterozygosity.
figure 3

a, Cis correlations are much greater in magnitude than are trans, as is expected if gene history has a principal role in heterozygosity. b, Local polymorphism rate versus extent of correlation. The rate of polymorphisms is measured using the TSC data, considering a 14-kb window of sequence around each read. The extent of correlation is defined as the distance over which the trans correlation (shown in a) drops to a quarter of its maximal value. c, Local recombination rate versus extent of correlation. The recombination rate in each region is estimated by comparing the physical5 with the genetic45 map (Methods).

The role of gene history in shaping the correlation in heterozygosity was also evident in two other analyses. First, we considered the relationship between the amount of polymorphism (magnitude of [π^]) and the distance over which a given value of heterozygosity persists. Theory predicts that 'old' loci—those with a large average value of τ—should show both a high rate of polymorphism (because there have been more generations in which mutation could occur) and a shorter correlation over distance (because there has been more opportunity for recombination to scramble gene histories), whereas 'young' loci should show the opposite pattern. This prediction was confirmed by the data (Fig. 3b). Similarly, theory predicts that the persistence of polymorphism rate should be shorter in genomic regions where, on average, meiotic recombination (as measured at a scale of megabases on the genome-wide linkage map)18 is more active. Comparison of the persistence of polymorphism rate for regions with high and low average rates of recombination bore out this prediction (Fig. 3c).

We characterized the distribution of variation in mutation rate (μ) across genomic segments and assessed its contribution to variation in heterozygosity (π). The classical method of estimating the human mutation rate is to compare human DNA sequence with that of a great ape, assuming that in any genomic region the amount of sequence divergence that has occurred is proportional only to the local mutation rate11. (This should be true if most mutations in the genome are not subject to natural selection and the time since the common ancestor is about the same across all loci.) Using data from GenBank and our own laboratory, we obtained 1.3 Mb of human and great ape (chimpanzee and orangutan) sequence alignments drawn from a total of 28 loci (Methods). We then measured the correlation over physical distance in interspecies sequence divergence (formally, the autocorrelation ρ(μxx+d) with distance).

We found that there is substantial variation in the amount of sequence divergence across different loci (Table 1). Notably, the amount of sequence divergence at a given locus typically also persists over significant distances (tens of thousands of base pairs) and drops to zero only over distances of 70–100 kb (Fig. 4a,b). Known determinants of local mutation rate (such as GC content) explain less than 10% of the measured variation in μ (data not shown), which shows that there must be major, as yet uncharacterized, determinants of mutation rate in the human genome. We note that there may be some variation in the coalescent age of different loci, even in a comparison across species19,20. Thus, the estimate from interspecies sequence divergence represents an upper limit rather than an exact measure of the variation in mutation rate (Methods). Below, we present estimates based on this upper limit and a lower limit of no variation in mutation rate across loci.

Figure 4: Correlation in mutation rate (inferred from sequence divergence).
figure 4

Great ape sequences are compared with the corresponding sequence from the human genome for 9 large-insert chimpanzee clones from GenBank (a) and 19 previously studied regions28 for both chimpanzee and orangutan sequence (b).

Measuring the correlation of gene history over distance

Using these measurements we assessed the quantitative contributions of variation in gene history and mutation rate to the local patterns of sequence variation in the human genome. For this calculation, we assumed that τ and μ are independent, which is equivalent to the common assumption of 'neutrality'21—that is, that the vast majority of mutations are not subject to natural selection. Examined using this framework, the data showed that variation in local gene history is the main determinant of local rates of human polymorphism and accounts for at least 57% of the variability in the amount of sequence diversity (μ) across loci (Methods and Table 1). Variation in mutation rate (μ) across loci has a lesser role and explains at most 25% of the variation in sequence divergence at different loci (Table 1). These data represent the first empirical and quantitative evaluation of how human genome sequence variation is shaped by variation in the genealogy of genes versus the rate of mutation across the genome.

The data also show that each given genealogical history typically persists across a region of considerable size in the human genome. We compared our measurement of the correlation in gene histories with the theoretical prediction under the standard population genetic assumptions (Wright–Fisher) of a constant-sized, freely mixing population and a uniform rate of recombination (Fig. 5a)12,13,14,15. Theory predicts a much shorter persistence of genealogical age12,13,14,15, with the correlation decreasing to less than 0.5 over only 3 kb, and to less than 0.2 by 11kb. By contrast, the empirically measured correlation in genealogy extends much further: similarity in gene history remains greater than 0.5 until 8–19kb, and greater than 0.2 at 100 kb. These results translate into a high probability that any two modern copies of a locus have been inherited without recombination since their most recent common ancestor (Fig. 5b; see Methods for equation and assumptions). Specifically, we have estimated that there is a 38–50% chance that any two copies of a segment 10 kb in length have been inherited without historical recombination since their shared ancestor, as compared with an expectation of only 17% under the standard population genetic assumptions above. These results have significant implications for patterns of linkage disequilibrium in the human genome.

Figure 5: Correlation in gene history.
figure 5

a, The simple correlation in gene history ρcisxx+d) (Methods). Open squares represent an upper limit and filled squares a lower limit. The upper limit is obtained by assuming that all variation in interspecies sequence divergence is attributable to variation in mutation rate; the lower is based on the assumption of a constant mutation rate across the genome. For comparison, the expectation15 for the cis correlation for a constant-sized population of N = 10,000 and a uniform recombination rate5 of 1.3cM/Mb is shown. b, Probability of no historical recombination (since the most recent common ancestral chromosome) between two sites separated by a distance d (Methods). c, Comparison of the value of linkage disequilibrium extrapolated from the correlation in gene history to a direct measurement in Americans of European ancestry based on genotyping of 2,745 polymorphic SNPs discovered by TSC and distributed over 51 genomic regions32. The two assessments are in qualitative agreement, with both exceeding the predictions of population genetic theory.

A genome-wide measurement of linkage disequilibrium

Linkage disequilibrium refers to nonrandom statistical associations between alleles at nearby sites and is a crucial tool for mapping genes that contribute to disease22,23,24. To assess linkage disequilibrium in this genome-wide data set, we combined our empirical measurements with a recently described statistical framework (G.M., unpublished data; see below for URL). This approach relates the correlation in genealogy (measured above) to the average extent of linkage disequilibrium in the genome, without making any assumptions about population history or selection.

A common measure of linkage disequilibrium is the r2 statistic, which is particularly relevant for gene mapping because its magnitude can be translated directly to the sample size that is required for an association study25. Our statistical framework provides an estimator of r2, which is accurate at predicting r2 for common (>10% frequency) alleles (G.M, unpublished data; see also refs 26, 27). Applying this approach to the empirical data, we obtain the first genome-wide estimate of the average extent of linkage disequilibrium in the human genome (for the samples used in polymorphism discovery). Linkage disequilibrium extends for significant distances (Fig. 5c), which are much longer than predicted under the standard assumptions described above, as has been shown in empirical studies based on direct genotyping across a much smaller fraction of the human genome sequence28,29,30,31,32. To our knowledge, these results represent the first truly genome-wide estimate of this crucial quantity for disease gene mapping.

This approach to estimating linkage disequilibrium has a significant advantage: a genome-wide estimate of linkage disequilibrium can be derived in any organism from only a measurement of heterozygosity. By sequencing DNA from several chimpanzees, for example, it should be possible to obtain simultaneously a genome sequence, a polymorphism map and a genome-wide assessment of linkage disequilibrium.

Inhomogeneous recombination in the human genome

We wanted to understand the mechanisms responsible for the unexpectedly long persistence of gene histories in the human genome. Previous work has suggested the involvement of population bottlenecks28,33,34,35,36, mixing of populations37 and hot spots of recombination38,39,40,41. We used coalescent computer simulations to understand which of these might be responsible for the observed long-range correlation in τ.

First, we compared our empirical results with those obtained by exploring a wide range of population demographic models (Methods). These allow for expansions and contractions in population size of various magnitudes at different times in the past (Fig. 6a). We found that changes in population size have only a modest effect on the correlation in gene history. In fact, we could not identify a model of population expansion or contraction that was consistent with the measured mean and variance in the age of alleles (τ), and that could generate the observed persistence in τ (Fig. 6a). This does not mean that expansions and contractions have not occurred in human history (they surely have28,33,34,35), but rather that population bottlenecks cannot be the sole explanation for the long regions of shared gene history that we observe.

Figure 6: Comparison of the observed and simulated correlation in gene history under a range of models of human demographic history and recombination.
figure 6

Open and filled squares represent the upper and lower limits, respectively, on the empirical estimate of ρcisxx+d). a, Models of population bottlenecks and expansions do not account for the long-range correlation in genealogy. Smooth lines represent the predicted ρcisxx+d) curve under a range of demographic schemes that are consistent with the observed mean and variance of τ. b,c, Models of population structure do not explain the long-range correlation in genealogy. Smooth lines represent a set of substructure models that produce mean values of τ and Var(τ) that are consistent with the observed data. Only where levels of differentiation between the mixing populations37 are more extreme than any observed between human populations (gray dashed lines) is long-range correlation produced (we measure differentiation by the classic42 statistic FST). The graph in c shows that structuring would have had to occur very recently, in the past 2,000 generations, to generate the observed patterns. d, Models of recombination rate inhomogeneity at scales less than a few hundred kilobases can explain the long-range correlation in gene history. The known recombination rate inhomogeneity at coarser scales cannot explain the pattern (dotted line). In the models shown, recombination hot spots are of equal intensity and spaced on average every 10, 40 and 160 kb (Methods). A curve whose shape closely resembles the observed data can be obtained from an arbitrary mixed model (shown in red) in which 45% of recombination events occurs at hot spots with average spacing of 160 kb and 55% occurs uniformly across the genome.

Second, we investigated the effects of population substructure (deviations from a freely mixing population) and found that this is also unlikely to explain the long stretches of shared gene history (Fig. 6b,c). Extreme models of structuring in the current population could produce long-range correlation according to our simulations (Fig. 6b and Methods), but they required a degree of substructure much greater than any observed in modern-day human populations42. It is possible that severe structuring might have occurred in the past; however, our simulations (Fig. 6c) indicated that it would have had to persist until relatively recently (within the past 2,000 generations or 50,000 years) to produce the observed long-range correlation. As there is no genetic or archaeological evidence for such extreme structure having existed so recently (and then disappeared through mixing) in the period since the appearance of modern humans outside Africa42, we consider this possibility unlikely although we cannot rule it out.

Last, we explored whether non-uniform recombination could explain the long-range similarity in polymorphism rates. The rate of recombination in the human genome varies at large scales (>1 Mb)18, and in one part of the genome it has been observed to vary markedly at a fine scale41. (Specifically, across a 200-kb region of the major histocompatibility complex locus, 95% of all recombination events are restricted to six hot spots of <2 kb that together cover <5% of the whole region41.) The general pattern of human recombination, however, has not been characterized at a fine scale. To understand the impact of recombination patterns on the correlation in gene history, we first examined the well-known pattern of variation at multi-megabase scales18. We found that variation at such a coarse scale has only a slight effect on the correlation curve (Fig. 6d) and cannot explain the long persistence that we observe. When we modeled fine-scale variation in recombination rate, however, we found a marked effect on the correlation (Fig. 6d). Fine-scale variation in recombination rate essentially corresponds to the presence of hot spots of recombination, which can greatly increase the average extent of correlation in gene history because 'colder' regions between hot spots are subject to relatively little historical recombination. Thus, of the schemes examined in this analysis, only an extremely inhomogeneous recombination rate at short distance scales is compatible with the long persistence of shared gene history in the human genome.

Discussion

In this genome-wide, empirical exploration of the fine-scale pattern and underlying causes of sequence variation in the human genome, we have shown that the human genome is composed of large regions, as long as 100 kb, that have intrinsically different rates of sequence polymorphism. The characteristics of these regions are determined largely by differences in gene history and less by differences in local mutation rate.

Our analysis also shows that shared gene history and linkage disequilibrium typically extend over much longer distances than would be expected under standard population genetic models and assumptions. Although this result has been suggested in previous studies of a few regions28,29,30,31,32, to our knowledge this study is the first truly genome-wide assessment. Of a range of schemes examined, inhomogeneous recombination offers the best explanation for the long correlation in gene history. Variation in recombination rate at multi-megabases scales has been described18,43 and hot spots of recombination have been reported anecdotally38,40,41, but we have now shown that recombination rate inhomogeneity at a fine scale is a general feature of the genome and has a major impact on human variation. Although our results cannot specify the exact architecture of this fine-scale variation, we propose that previously described hot spots41 offer the most likely explanation.

Our results have implications for genetic association studies of human disease. The inheritance of chromosomal regions without recombination from shared ancestors (Fig. 3c), which are also known as haplotypes, freezes particular combinations of alleles in the population. Haplotypes are valuable for medical genetic studies because they allow mapping of disease-susceptibility alleles without the need to discover and test every SNP across each chromosomal region22,23,24. If recombination rate inhomogeneity is indeed a defining feature of linkage disequilibrium in the human genome, then SNPs interspersed in any region of low recombination will track together in the population32,39,41. Our study provides fundamental evidence that gene mapping techniques that take advantage of this phenomenon will be broadly applicable across the human genome.

Methods

TSC and BAC overlap data sets.

The data sets were subsets of those used to obtain a map of 1.42 million SNPs1. The TSC data were obtained by comparing random sequencing reads, averaging 514 bp (range 400–700bp) of high-quality sequence, with single BAC sequences from version OO18 of the public human genome assembly5. For the subset of TSC data that included 'paired-end' reads with sequence available from two ends of cloned human inserts, it was possible to make cis comparisons; the remainder of comparisons were trans. The BAC overlap data were obtained by comparing 500-bp segments of finished sequence from BAC clones in the RPCI-11 library (which was used for most of the public sequence5) with a finished BAC from another library. Most of the BAC overlap results were therefore cis comparisons, but trans comparisons were obtained by examining triplets of overlapping BACs: one from RPCI-11, one from another RPCI or Caltech clone library, and one from any third library.

To minimize contamination by low-copy paralogous repeats (which can generate spurious stretches of high measured heterozygosity), we eliminated reads in which five or more SNPs were observed3,4. To eliminate regions near identifiable repeats, we removed from our genome-wide analysis any 200-kb window in which more than 2% of reads aligned to different genomic locations (this criterion was based on a detailed comparison of heterozygosity and repeat content). The more stringent filtering explains why the TSC heterozygosity that we report differs by 4% from a previous estimate1. (The two analyses also differ in that reads aligning to more than one BAC were omitted from the current analysis and that the contribution of each read to heterozygosity was weighted by the number of times that it was used in the calculation of the correlation.)

Heterozygosity and correlation statistics.

We calculated heterozygosity by dividing the number of SNPs observed by the number of bases for which high-quality sequence was available (defined by the neighborhood quality standard3). Correlation in polymorphism rate ρ(πxx+d) was calculated as described below. Cis correlation at very short distances (for the BAC overlap data) was calculated by splitting 500-bp reads into two perfectly overlapping 250-bp reads of alternating base pairs. The TSC data did not provide a pure trans comparison: 1 of 48 comparisons in the TSC data set involved (by chance) a cis comparison of two reads from the same chromosome in the pool. The trans curves shown (Fig. 3a) are therefore corrected for roughly a 2% admixture of cis data.

Error bars in all figures correspond to 1 s.d. and were calculated by bootstrapping: that is, new correlation curves were generated by re-sampling a random subset of the original data set44. The TSC data were re-sampled 50 times, the BAC overlap data 100 times, and the human–chimpanzee data 400 times. Because the heterozygosity of nearby reads was correlated, we carried out re-sampling by partitioning the genome into contiguous sectors. Each sector spanned 5% of the total data set for BAC overlap and GenBank human–chimpanzee comparisons, and 1% of the total data set for the TSC comparisons.

Mathematical formulae.

The expected polymorphism rate between two samples, with an observed rate π^ and true underlying polymorphism rate π, is:

The variation in observed polymorphism rate, Var(π^), can be parsed into its determinants τ, μ and L (the length of a sequencing read), according to the following equation derived in Web Note A online:

The 'stochastic variance' term E[π/L] arises because the observed value, π^, is not a perfect estimate of the underlying value π because of the limited length of sequence examined. We therefore estimated Var(π) as Var(π^) − E[π/L] (see Web Note A for details).

The similarity (covariance) in observed polymorphism rate π^ between two sites separated by a distance d is defined by an equation of a form similar to the variance equation (see Web Note A for derivation):

To obtain Cov(μxx+d), we used the great ape–human comparisons, setting Cov(τxx+d) = 0.

Correlations are equal to covariances divided by variances:

where απ^ = , π, μ or τ.

One of us has derived a relationship between correlation in genealogical history (τ) and linkage disequilibrium (as measured by an estimator of r2 called σd2; G.M., unpublished data):

.

The term 'dis', or disjoint correlation, refers to cases where all four segments that are used in an assessment of similarity in gene history are from different individuals (two at position x in the genome and two others at position x + d). The disjoint curve we used in our analysis was obtained by comparing two different TSC reads that mapped to two different BAC clones.

For a constant-sized population, we derive a relationship (see Web Note A) between ρcisxx+d), ρtrans (τx, τx+d), and the probability that the current haplotype between two sites a distance d apart has been inherited without recombination from the most recent common ancestor (1 − Θ(d)).

.

Stochastic variance correction.

In our analysis, we eliminated reads with five or more SNPs (see above), which affects the stochastic variance estimate, E[π/L]. Therefore, we slightly modified the variance estimate on the basis of computer simulations of the effect of truncating reads. We made an additional small adjustment to address the 5% false-positive rate in SNP identification1 (details are available on request).

Comparison of genetic to physical map.

For assessment of the effect of recombination rate in reducing the extent of correlation of heterozygosity (Fig. 3c), we estimated recombination rates by comparing genetic distances between markers obtained from the Marshfield map45 with their physical distance seperations. (We compared markers approximately 5 Mb apart.) The first and last markers on each chromosome, and regions for which there was disagreement between physical and genetic map marker order, were discarded.

To determine how the known inhomogeneity in recombination rate at scales of more than 1 Mb would affect the theoretically predicted ρcisxx+d) curve (Fig. 6d), we compared the physical map of the human genome5 with the Marshfield genetic map45 and built up a histogram of recombination rates by examining all marker spacings greater than 1 Mb. We then selected values randomly from this distribution and combined them with the theoretically predicted ρcisxx+d) curve to obtain the dashed line in Fig. 6d.

Great ape–human comparisons and autocorrelation in μ.

From GenBank we obtained nine sequences from chimpanzees (accession numbers below). We aligned these to the human sequence and identified divergences through the same computational and filtering steps as above (allowing up to 9% sequence divergence across any 200-bp stretch). To test the generality of the observed patterns, we examined a second data set consisting of 180 kb of chimpanzee sequence and 110 kb of orangutan sequence that we compared with human sequence at 19 independent loci28. Altogether, our assessment was based on 18,350 sequence divergences between great apes and humans.

These three estimates (chimpanzee–human comparisons from both GenBank and the 19 loci, and orangutan–human comparisons from only the 19 loci) agreed qualitatively, with autocorrelation persisting over tens of kilobases and dropping to zero only after 70–100 kb (Fig. 4a). We used the largest data set (from GenBank) to obtain ρcisxx+d) for use in subsequent calculations, fitting a straight line to the log plot in Fig. 4a (omitting the data point at the shortest distance to minimize possible correlations owing to shared history at neighboring loci).

To translate our results on primate divergence into μ and τ, we required an estimate of the mean mutation rate per nucleotide per generation (we used a value of 2.5 × 10−8, obtained by calibration from the fossil record46). Errors in this estimate would not affect the conclusions of this paper and would change only four lines of Table 1 (estimates of the mean and standard deviation of μ, and of the mean and standard deviation of τ).

Computer simulations.

We first considered a freely mixing population with a constant effective size of N individuals until G generations ago, when there was a bottleneck that produced an inbreeding coefficient of F followed by a rapid expansion to very large size (Fig. 6a). We used coalescent computer simulations16 to generate ρcisxx+d) curves for a full range of models (N, G and F) that were consistent with the observed mean τ and range of the coefficient of variation (0.76–1.11; Table 1). Specifically, to identify values of N, G and F that were consistent with our data, we wrote equations for the mean τ and the coefficient of variation of τ in terms of the parameters N, G and F in our model (available on request). Solving for the equations left only one free parameter, which we varied over its full range for coefficients of variation of τ of 0.76, 1 and 1.11, to explore the full range of demographic schemes that were consistent with our model and data.

We next considered a population that was freely mixing and of constant size N until G generations ago, when it split into two populations of equal size N that remained separate until a recent mixing event in which a proportion p of samples came from the first population (Fig. 6b). We generated curves for a full range of models (N, G and p) that were consistent with the observed mean τ and range of the coefficient of variation of τ and constrained the parameters as for the simulations of bottlenecks and expansions (details of model available on request). We also modified the simulations to explore schemes in which the population substructure ended in the more distant past, that is, the populations mixed together again H = 2,000 generations ago (Fig. 6c).

Last, we considered models in which recombination occurs only at hot spots, which were assumed to be of equal intensity and distributed randomly (according to a Poisson process) with mean spacing of k kilobases, for k = 10, 40 and 160 kb (Fig. 6d). Computer simulations16 were used to calculate the expected curve for different hot spots spacings and a constant-sized, freely mixing population.

URL:

Further information on the statistical method for obtaining linkage disequilibrium from correlation data can be obtained from http://www.stats.ox.ac.uk/~mcvean/ldgene.pdf.

GenBank accession numbers.

Chimpanzee sequences, AC087834.1, AC087835.1, AC087778.1, AC087736.1, AC087777.1, AC087568.1, AC087513.1, AC087264.2, AC087602.2.

Note: Supplementary information is available on the Nature Genetics website.