Main

Genome-wide association studies (GWAS) have led to the discovery of hundreds of marker loci that are associated with complex traits, including disease and quantitative phenotypes1, yet for most traits, the associated variants cumulatively explain only a small fraction of total heritability2. GWAS have provided insight into biology through the discovery of pathways that were previously not known to be involved in the trait and the discovery of genes and pathways that are common to two or more complex traits3. As an experimental design, GWAS are hypothesis generating, and typically very stringent statistical thresholds are set to control false positive rates. This approach is at the expense of the false negative rate, that is, failure to detect loci that are associated with the trait but whose effect sizes are too small to reach genome-wide statistical significance. In addition, GWAS typically use common SNP markers. If ungenotyped causal variants have a lower allele frequency than the SNPs in the GWAS, they will be in low linkage disequilibrium (LD) with common SNPs, and the effect estimated at the SNPs will be proportionally attenuated. That is, the proportion of heritability that can be captured with common SNPs depends on how well causal variants are tagged by these SNPs. For these reasons, the cumulative genetic variation accounted for by SNPs that reach genome-wide statistical significance is certain to be smaller than the total genetic variance.

An alternative to hypothesis testing is to focus on the estimation of the variance explained by all SNPs together. Recently, we showed how this may be done and estimated that 45% of phenotypic variation for human height is accounted for by common SNPs from a sample of 4,000 Australians with ancestry in the British Isles4. In a separate study, we partitioned additive variance for height onto chromosomes using within-family segregation, which captures the effects of all causal variants, and concluded that the variance was explained in proportion to chromosome length5. Here we take these studies further, using a much larger sample of 11,586 unrelated European Americans and considering a range of traits. We partitioned additive genetic variation for height, BMI, von Willebrand factor (vWF) and QTi onto the autosomes, the X chromosome and genomic segments. vWF is a large adhesive glycoprotein that circulates in plasma and is essential in hemostasis, whereas QTi is an important electrocardiographic measure related to ventricular arrhythmias and sudden death. We find that the genetic variation explained by a genomic segment is proportional to the length of DNA contained within genes in that segment. We estimate the proportion of variation due to population structure and report empirical results for the X chromosome that are consistent with full dosage compensation (X inactivation) in females in genes that affect these traits.

Results

Variance explained by all autosomal SNPs

We selected 14,347 individuals from three population-based GWAS (the Health Professionals Follow-up Study (HPFS), the Nurses' Health Study (NHS) and the Atherosclerosis Risk in Communities (ARIC) study6,7,8) and estimated the genetic relationship matrix (GRM) of all the individuals using 565,040 autosomal SNPs that passed quality control (Online Methods). We excluded one of each pair of individuals with an estimated genetic relationship >0.025 (that is, more related than third or fourth cousins) and retained a subset of 11,586 unrelated individuals. The reason for excluding related pairs is to avoid the possibility that the phenotypic resemblance between close relatives could be because of non-genetic effects (for example, shared environment) and causal variants not tagged by SNPs but captured by pedigree9,10. We then fitted the GRM in a mixed linear model (MLM) to estimate the proportion of variance explained by all the autosomal SNPs for height, BMI, vWF and QTi in each cohort and the combined data where applicable (Online Methods, Table 1 and Supplementary Table 1). Data on vWF and QTi were available from the ARIC sample only. We show that 44.8% (s.e. 2.9%) of the phenotypic variance for height can be explained by all the autosomal SNPs, which is in line with an estimate of 44.5% (s.e. 8.3%) from a similar analysis of an Australian cohort (3,925 unrelated individuals genotyped by 294,831 SNPs on Illumina arrays, in contrast to the Affymetrix arrays used in the present study)4. We show for the first time that 16.5% (s.e. 2.9%), 25.2% (s.e. 5.1%) and 20.9% (s.e. 5.0%) of variances for BMI, vWF and QTi, respectively, can be explained by all the autosomal SNPs, which is approximately tenfold, twofold and threefold larger than the variance explained by all known validated loci found by GWAS for BMI11,12,13,14, vWF15 and QTi16, respectively. We note that the ABO blood group locus on chromosome 9 is known to explain approximately 10% of phenotypic variation for vWF15 through modification of the amount of H antigen expression on the circulating vWF glycoprotein17,18. The estimate of for weight is 18.6% (s.e. 2.8%). Because of the high phenotypic correlation between BMI and weight (r = 0.92), results for these two traits are very similar. We therefore report results for BMI in the following sections and for completion give all results for weight in the supplementary online material (Supplementary Figs. 1–7 and Supplementary Tables 1–13).

Table 1 Estimates of the variance explained by all autosomal SNPs for height, BMI, vWF and QTi

Genome partitioning of genetic variation

Next, we estimated the GRM from the SNPs on each autosome and partitioned the total genetic variance onto individual chromosomes by fitting the GRMs of all the chromosomes simultaneously in a joint analysis (Online Methods). We observed a strong linear relationship between the estimate of variance explained by each chromosome and chromosome length (LC, in Mb units) for height (P = 1.4 × 10−6 and R2 = 0.695) and QTi (P = 1.1 × 10−3 and R2 = 0.422) (Fig. 1 and Supplementary Tables 2 and 3). We mapped SNPs to 17,787 genes according to positions on the UCSC Genome Browser hg18 assembly19, 17,652 of which had at least one SNP within ±50 kb of the 5 and 3 untranslated regions (UTRs). There was also a significant correlation between the estimate of and the number of genes on each chromosome (Ng(C)) for height (P = 7.9 × 10−3) and QTi (P = 8.1 × 10−4) (Supplementary Table 3). Because LC and Ng(C) are correlated (r = 0.628), we performed a multiple regression analysis of the estimate of on LC and Ng(C) and fitted models in which chromosome length was fitted after the number of genes and vice versa. When including both LC and Ng(C) in the regression model, Ng(C) was not significant and LC was still significant for height (P = 8.8 × 10−5) and QTi (P = 2.8 × 10−4) (Supplementary Table 3). The regression of the estimate of on either LC or Ng(C) was not significant for BMI and vWF. These results are consistent with the variance explained by each chromosome for height and QTi (but less so for BMI and vWF) being proportional to the proportion of the genome being considered. Although longer chromosomes harbor more genes that are implicated in abnormal growth or skeletal development, the relationship between variance explained for height and chromosome length remained significant (P = 0.016) after fitting the number of such genes (Supplementary Fig. 1). We provide evidence that the linear relationship between the estimate of and LC cannot be attributed to the fact that longer chromosomes have more SNPs and thereby smaller sampling errors when estimating genetic relationships between individuals (Supplementary Note and Supplementary Figs. 2 and 3).

Figure 1: Variance explained by chromosomes.
figure 1

Shown are the estimate of the variance explained by each chromosome for (a) height (combined), (b) BMI (combined), (c) vWF (ARIC) and (d) QTi (ARIC) by joint analysis using 11,586 unrelated individuals against chromosome length. The numbers in the circles and squares are the chromosome numbers. The regression slopes and R2 were 1.6 × 10−4 (P = 1.4 × 10−6) and 0.695 for height, 2.3 × 10−5 (P = 0.214) and 0.076 for BMI, 6.9 × 10−5 (P = 0.524) and 0.021 for vWF, and 1.2 × 10−4 (P = 1.1 × 10−3) and 0.422 for QTi, respectively.

However, genes vary greatly in size, and when we considered the length of the genes, we observed that the estimate of for height and QTi was also proportional to the total length of genes on each chromosome (Lg(C)), where gene length is defined as the physical distance between the beginning and end of the UTRs (Supplementary Fig. 4). Because the correlation between LC and Lg(C) is extremely high (r = 0.97), we were unable to discriminate whether LC or Lg(C) is causative by multiple regression: the regression of on LC was not significant after being fitted for Lg(C) and vice versa (Supplementary Table 3). Therefore, a different analysis was required. We asked whether we could still observe a significant regression of on Lg(C) when chromosome length was held constant. We investigated this by dividing the genome into segments with the same length of either 50 or 30 Mb and then estimated the variance explained by each segment in a joint analysis (Online Methods). We found that the regression of on the total gene length per segment (Lg(S)) remained significant for height, with P = 1.7 × 10−3 for 50-Mb segments and P = 1.2 × 10−4 for 30-Mb segments (Supplementary Fig. 5). The regressions of on the number of genes, the total length of exons and the number of exons on each segment were also significant in some cases, but none of the regressions were significant when fitted after Lg(S), whereas Lg(S) was always significant fitted after any of them (Supplementary Table 4). These results suggest that, at least for height, genomic regions explain variation in proportion to their genic content.

To quantify these effects genome wide, we partitioned the variance explained by all the SNPs onto genic and intergenic regions of the whole genome (Online Methods). We defined the gene boundaries as ±0 kb, ±20 kb and ±50 kb of the 3′ and 5′ UTRs. A total of 213,509, 282,058 and 336,127 SNPs were located within the boundaries of 13,406, 17,277 and 17,652 protein-coding genes for the three definitions (±0 kb, ±20 kb and ±50 kb), respectively, which covered 35.8%, 49.4% and 58.7% of the genome. Some genes did not have any SNPs within them, especially if we used the most stringent definition of gene boundary (±0 kb). We tested the estimates of and against the expected values from the genic and intergenic coverages by a goodness of fit test. We found strong evidence for height and vWF, and less so for BMI and QTi, that genic regions proportionally explain more variation than intergenic regions (see legends of Fig. 2 and Supplementary Fig. 6). As an example, we considered the case of genes ±20 kb of the 3′ and 5′ UTRs, where genic and intergenic coverages are roughly equal (49.4% compared to 50.6%). The estimates of compared to were 32.8% versus 12.6% (POE = 2.1 × 10−10) for height, 22.7% versus 4.0% (POE = 5.1 × 10−4) for vWF, 11.7% versus 4.7% (POE = 0.022) for BMI and 13.5% versus 7.5% (POE = 0.251) for QTi , where POE is the goodness of fit test P value of the estimated / against that expected from the coverage of genic regions. We further partitioned the genetic variance onto the genic and intergenic regions of each chromosome (Online Methods). In general, the results agree with those of the whole-genome partitioning analysis in that the genic regions proportionally explained more variation (Fig. 2 and Supplementary Fig. 6). The variance attributable to chromosome 9 for vWF is dominated by the genic regions, which is expected because ABO on this chromosome explains 10% of its variance15. However, there appear to be exceptions, for example, the intergenic regions of chromosome 2 and chromosome 5 seemed to be more important for BMI and QTi, respectively. These results are not conclusive because the standard errors of the estimates are large. Despite these special cases, overall, the results are consistent with causal variants being more likely to be located in the vicinity of functional genes.

Figure 2: Estimates of the variance explained by genic and intergenic regions on each chromosome for height by the joint analysis using 11,586 unrelated individuals in the combined dataset.
figure 2

The genic region is defined as (a) ±0 kb, (b) ±20 kb and (c) ±50 kb of the 3′ and 5′ UTRs. A total of 13,406, 17,277 and 17,652 protein-coding genes had at least one SNP located within their boundaries for the three definitions (±0 kb, ±20 kb and ±50 kb), which covered 35.8%, 49.4% and 58.7% of the genome, respectively. Error bars represent the standard errors of the estimates. The estimates of variance explained by all the genic and intergenic SNPs across the whole genome and are (a) 0.256 (s.e. 0.023) and 0.196 (s.e. 0.025) with POE = 1.9 × 10−8; (b) 0.328 (s.e. 0.024) and 0.126 (s.e. 0.022) with POE = 2.1 × 10−10; and (c) 0.379 (s.e. 0.025) and 0.08 (s.e. 0.019) with POE = 6.0 × 10−13, where POE is the goodness of fit test P value of the estimated / against that expected from the coverage of genic regions.

Quantifying the effect of population structure

To quantify the effect of population structure, we estimated the variance for each chromosome when analyzed individually and when analyzed jointly in the entire sample of 14,347 individuals (without removing cryptic relatives) and regressed the difference between these estimates on chromosome length (Online Methods). The intercept of this regression (b0) appears to be due to cryptic relatedness because when we eliminated relatives with a relationship >0.025, b0 declined to zero (Fig. 3). We therefore predicted that cryptic relatedness accounted for 1.5%, 0.084%, 0.22% and 0.065% (not significant) of the phenotypic variance for height, BMI, vWF and QTi, respectively, in the entire sample. The variance attributed to cryptic relatedness is irrespective of chromosome length because it does not require very many SNPs per chromosome to detect close relatives. Conversely, the regression slope b1 appears to be due to population stratification because longer chromosomes are likely to have more ancestry informative markers (AIMs), assuming that the AIMs are randomly distributed across the genome. We then predicted that population stratification accounted for 6.9 × 10−5 LC, 7.2 × 10−6 LC, –1.92 × 10−6 LC (not significantly different from zero) and 2.3 × 10−5 LC of variance for height, BMI, vWF and QTi, respectively, in the entire sample and a similar amount in the dataset of unrelated individuals (Fig. 3). The difference between (sep) and represents the overall effect of all the other 21 chromosomes on one chromosome. Therefore, the proportion of variance attributed to population structure (cryptic relatedness and population stratification) across the whole genome is approximately equal to , which is (1.6% + 0.91%), (0.088% + 0.095%), (0.23% + 0.0%) and (0.068% + 0.30%) for height, BMI, vWF and QTi, respectively, in the entire sample. Hence, we provide a simple approach to estimate and partition the variance attributed to population structure for complex traits. The variances due to cryptic relatedness and population stratification depend on the data structure in the sample. Therefore, the estimates we present above are specific for the data in this study.

Figure 3: Variance due to cryptic relatedness and population stratification.
figure 3

Shown is the difference between the estimates of variance explained by each chromosome by the separate and joint analyses for (a) height (combined), (b) BMI (combined), (c) vWF (ARIC) and (d) QTi (ARIC) against chromosome length. All, using all the individuals in the entire sample. Unrelated, using unrelated individuals after excluding one of each pair of individuals with an estimate of genetic relationship >0.025. The intercept and slope are 0.015 (P = 5.5 × 10−10) and 6.9 × 10−5 (P = 3.4 × 10−7) for height; 8.4 × 10−4 (P = 0.046) and 7.2 × 10−6 (P = 0.020) for BMI; 2.2 × 10−3 (P = 0.025) and –1.9 × 10−6 (P = 0.779) for vWF; and 6.5 × 10−4 (P = 0.401) and 2.3 × 10−5 (P = 4.1 × 10−4) for QTi in the entire sample and are 0.002 (P = 0.070) and 5.6 × 10−5 (P = 5.5 × 10−7) for height; 2.9 × 10−4 (P = 0.556) and 7.1 × 10−6 (P = 0.054) for BMI; 1.7 × 10−3 (P = 0.179) and 1.1 × 10−6 (P = 0.901) for vWF; and 5.9 × 10−4 (P = 0.523) and 2.4 × 10−5 (P = 0.001) for QTi in unrelated individuals.

It is common to fit eigenvectors (principal components) from principal component analysis in single SNP association studies to correct for possible population structure20,21. We show that fitting the first ten principal components and one chromosome at a time or fitting all chromosomes simultaneously without fitting principal components led to similar estimates of the variance explained by each chromosome (Supplementary Fig. 7), which suggests that the majority of variance attributed to population structure is well captured by the first ten principal components in these data.

Estimation of variance explained by the X chromosome

We estimated the GRM for the X chromosome and parameterized it under three assumptions of dosage compensation9: (i) equal X-linked variance for males and females; (ii) no dosage compensation (both X chromosomes are active for females); and (iii) full dosage compensation (one of the X chromosomes is completely inactive for females). We fitted the parameterized GRMs for the X chromosome in an MLM while simultaneously estimating in the model to capture the genetic variation on the autosomes and variation due to possible population structure. For all the traits, the full-dosage compensation model fits the data best and the no-dosage compensation model is the worst, with the equal-variance model being in between (Supplementary Table 5). However, the differences in estimates were relatively small and none of them were statistically significant. Larger datasets will be required to distinguish such small differences. Under the assumption of full dosage compensation, the variance attributable to the X chromosome for females was 0.61% (s.e. 0.32%), 0.82% (s.e. 0.35%), 0.57% (s.e. 0.52%) and 0.0% (s.e. 0.48%) for height, BMI, vWF and QTi, respectively. To verify those results, we detected heterogeneous variances on the X chromosome rather than autosomal variance differences between males and females, and we fitted the same dosage compensation models for the autosomes. The equal variance model fitted the data best and the full dosage compensation model was the worst fit for all the traits (Supplementary Table 6). Therefore, the data are consistent with twice as much additive genetic variation for height, BMI and vWF on the X chromosome in males as in females, which is predicted from theory under the assumption of random X inactivation22. Although there are syndromic examples illustrating the phenotypic effect of the Lyon hypothesis (for example, Turner's syndrome and Kleinfelter syndrome), to our knowledge, this is the first empirical evidence from genotype-phenotype associations on complex traits that the amount of genetic variation on the X chromosome appears consistent with X-chromosome inactivation. However, the evidence is indirect and not overwhelming. Larger samples sizes and the detection of multiple associated loci on the X chromosome will be necessary to investigate the expression of genes on the X chromosome that affect the traits studied.

Comparison with known associated variants

To quantify the effect of known associated variants on the results, we included the FTO SNP rs9939609 on chromosome 16 for BMI and the ABO SNP rs612169 on chromosome 9 for vWF as a covariate when estimating by the joint analysis of all autosomes. FTO was the first locus to be detected through GWAS that is associated with BMI13, and ABO is a major determinant of vWF18. When compared to the result without adjustment, the estimate of variance due to chromosome 16 for BMI decreased from 1.19% to 0.61%, which is in line with an estimate of 0.34% to 1% of variance explained by the FTO locus for BMI in previous GWAS11,13,14 and an estimate of 0.46% from the association analysis in the present study; the estimate of for vWF decreased by 11.8%, which is consistent with an estimate of 10% of variance for vWF explained by the ABO locus in GWAS15; and the estimates for the other chromosomes remained the same (Supplementary Fig. 8).

The meta-analysis of 133,000 individuals by the GIANT consortium has identified 180 independent loci associated with genetic variation of height23. The estimate of by a joint analysis in our study shows a high correlation (r = 0.715 and P = 1.8 × 10−4) with the sum of the variance explained at the associated loci on each chromosome from the GIANT meta-analysis (Fig. 4).

Figure 4: The sum of variance explained by the GWAS associated SNPs on each chromosome in the GIANT meta-analysis of height23 against the estimate of variance explained by each chromosome for height by the joint analysis using the combined data of 11,586 unrelated individuals in the present study.
figure 4

We calculated the variance explained by GWAS loci in the GIANT meta-analysis based on the result of its replication study. The regression R2 is 0.

Additional models

We fitted a number of other models to quantify the effect of having multiple phenotypic observations per individual and to test for genotype-sex interaction effects and for the effect of sample ascertainment. We also estimated the genetic correlation between height and weight. These additional models exemplify the versatility of the linear mixed model methodology used in this study. Results are shown in the Supplementary Note.

Discussion

In this study, we estimate that 45%, 17%, 25% and 21% of phenotypic variation for height, BMI, vWF and QTi, respectively, is tagged by common SNPs, and we partition this variation onto autosomes, chromosome segments and the X chromosome. We find that chromosome segments explain variation in approximate proportion to the total length of genes contained therein. Although this suggests that there are very many polymorphisms affecting these traits, the linear relationship between the estimate of variance explained and genomic length is not perfect, especially for BMI and vWF. Chromosomes with similar (genic) lengths can explain different amounts of variation (Fig. 1 and Supplementary Fig. 4), and the estimates of variance explained by genomic segments with equal length also show large variability (Supplementary Fig. 5), suggesting some granularity in the distribution of causal variants. The genetic architecture of vWF is distinct from the other traits we analyzed, as a large proportion of variance is explained by a common SNP in a single gene (ABO). We show that the variance attributed to a single major gene can be captured by all the SNPs on that chromosome or the whole genome, showing that our whole-genome and chromosome estimation approach is independent of the distribution of effect sizes. Our results provide further evidence for the highly polygenic nature of complex trait variation and that a substantial proportion of genetic variation is tagged by common SNPs4,24. These results have implications for the experimental design to detect additional variation and are informative with respect to the nature of complex trait variation.

Of the four traits studied, the largest proportion of phenotypic variance explained by the SNPs was for height and the smallest was for BMI. Why are the results for height and BMI so different? Heritability of height is approximately 80%, and we estimate that more than half of this variation (45/80 = 0.56) is tagged by common SNPs. Estimates of the narrow sense heritability of BMI appear to be more variable, ranging from 42–58% when estimated from the correlation of full brothers and fathers and sons25 to 60–80% from twin studies26. Nevertheless, even if we assume that the narrow sense heritability for BMI is 50%, then only 17/50 = 0.34 of additive genetic variation is explained by common SNPs. Given these assumptions and the standard errors listed in Table 1, the standard error of the difference in the proportion of genetic variance explained for height and BMI is approximately 0.07, so the observed difference of 0.22 appears statistically significant. These results are consistent with the proportion of phenotypic variation for height and BMI explained by genome-wide significant SNPs in that for height, about 10% of the phenotypic variance is explained, yet for BMI the phenotypic variance explained is less than 2%14,23, despite similar and large experimental sample sizes. These results imply that causal variants for BMI are in less LD with common SNPs than causal variants for height, possibly because, on average, causal variants for BMI have a lower minor allele frequency than causal variants for height. Both observations from GWAS and our analyses are consistent with the allelic architecture for BMI being different from that for height. Different evolutionary pressures on obesity (or leanness) and height could account for such differences because natural selection will result in low frequencies of alleles that are correlated with fitness27. However, we do not provide direct evidence to support this hypothesis.

If genetic variation is a function of the length of a chromosome segment occupied by genes, then this implies that causal variants are more likely to occur in the vicinity of the genes than in intergenic regions (Fig. 2 and Supplementary Fig. 6). These causal variants could either change the protein structure or regulate the expression of the gene in cis. However, regulatory elements sometimes occur a long distance away from the gene they regulate, and our results show that SNPs situated >50 kb from any gene still explain some of the variance, although they explain less than SNPs nearer to a gene. These results are consistent with analyses of published genome-wide significant SNPs for complex traits in that a substantial proportion is found in intergenic regions1.

GWAS for height, BMI, vWF and QTi to date have identified individual genetic variants that cumulatively explain about 10%, 1.5%, 13% and 7% of phenotypic variation, respectively14,15,16,23. In contrast, we show that 45%, 17%, 25% and 21%, respectively, of the variance is explained by common SNPs (Table 1). The difference between these two sets of figures is caused by SNPs that are associated with the traits but do not reach genome-wide significance. The proportion of variance explained by all the SNPs is less than the heritability because of incomplete LD between the causal polymorphisms and the SNPs. Therefore, experiments to find SNPs that pass the genome-wide significance threshold can focus on the proportion of variation that is tagged by common SNPs by increasing sample size or focus on the proportion of variation that is not tagged, for example, by considering less common variants. The former approach has been successfully done by the GIANT consortium, which reported that 10% and 1.5% of variation for height and BMI, respectively, can be accounted for by common SNPs using sample sizes of more than 100,000 (refs. 14,23). The latter will be facilitated by the 1000 Genomes Project28 and independently by efforts to sequence exomes and whole genomes. Experimental designs to discover causal variants that are in LD with common SNPs and those that interrogate less common or rare variants are complementary, and recent publications that suggest that all or most variation for disease is to be found in less common or rare (coding) variants29,30 are not consistent with empirical data, at least for a range of complex traits, including height, BMI, lipids and schizophrenia14,23,24,31. For those causal variants that are rare in the population (for example, with a frequency of less than 1%), an important but unanswered question is whether their effect sizes are large enough to be detected through conventional association analysis. The power of detection for a rare variant is proportional to the product of its frequency (which is small) and the square of its effect size. Hence, rare variants will be detected only if their effect sizes are large enough given their low frequency. Our results imply that there are many chromosomal regions that contain causal variants and so most must explain a small proportion of total variance. Such small contributions can be due to loci with very low minor allele frequency and large effect sizes, but our ability to detect them by association is limited by the amount of variance explained.

Genome partitioning methods such as applied here help us further understand the genetic architecture of complex traits. All the methods and analyses presented in this paper have been implemented in the GCTA software9. With ever larger samples sizes, the methods we have used and those that are based upon traditional GWAS analyses will converge in inference in that we will be able to partition variation to individual loci.

URLs.

UCSC Genome Browser, http://genome.ucsc.edu/; GCTA, http://gump.qimr.edu.au/gcta/.

Methods

GWAS samples and quality control.

Details of the HPFS, NHS and ARIC cohorts have been described previously6,7,8. The GWAS data in terms of study design, sample selection and genotyping have been detailed for the HPFS and NHS37 cohorts and for the ARIC cohort8. All three cohorts have been studied as part of the GENEVA (the Gene, Environment Association Studies) project38, and this study has benefitted from using data from the consortium that have been generated and cleaned using a common protocol. We selected 6,293 individuals (2,745 cases with type 2 diabetes and 3,148 controls) from the NHS and HPFS cohorts and 15,792 individuals from the ARIC cohort. All of these selected individuals were genotyped using the Affymetrix Genome-Wide Human 6.0 array.

Of the 909,622 SNP probes, 874,517 (HPFS), 879,071 (NHS) and 841,820 (ARIC) passed quality control analysis performed by the Broad Institute and the GENEVA Coordinating Center (excluding SNPs with missing call rate ≥5% or plate association P < 1 × 10−10)39. We further excluded SNPs with missing rate ≥2%, >1 discordance in the duplicated samples, Hardy-Weinberg equilibrium P < 1 × 10−3 or minor allele frequency <0.01. A total of 687,398 (27,578), 665,163 (24,108) and 593,521 (23,664) autosomal (X chromosome) SNPs were retained for the HPFS, NHS and ARIC cohorts, respectively, 565,040 (21,858) of which were in common across the three cohorts.

We included only one of each set of duplicated samples and one of each pair of samples that were identified as full siblings by an initial scan of relatedness in PLINK40. We investigated population structure by PCA of all the autosomal SNPs that passed quality control and included only samples of European ancestry (Supplementary Fig. 9). We excluded samples with gender misidentification by examining the mean of the intensities of SNP probes on the X and Y chromosomes. We also excluded samples with missing call rate ≥2% and samples on two plates that showed an extremely high level of mean inbreeding coefficients. A total of 2,400 (HPFS), 3,265 (NHS) and 8,682 (ARIC) samples were retained for analysis with a combined set of 14,347 samples.

Phenotypes.

Summary statistics of the phenotypes of height, weight, BMI, vWF and QTi are shown in Supplementary Table 7. There are three measures of weight and a single measure of height in both the HPFS and NHS cohorts, four measures of weight and three measures of height in the ARIC cohort, and single measures of vWF and QTi in the ARIC cohort. For height, weight and BMI, we used the mean of repeated measures in all the analyses except for the analysis of the repeatability model. We adjusted the phenotypes (or the mean phenotype) for age and standardized it to a z score in each gender group in each of the three cohorts separately.

Statistical analysis.

We estimated the GRM of all individuals in the combined data from all the autosomal SNPs using the method we recently developed4,9 and excluded one of each pair of individuals with an estimated genetic relationship >0.025. We then estimated the variance explained by all autosomal SNPs by restricted maximum likelihood analysis of an MLM y = + gG + ɛ, where y is a vector of phenotypes, b is a vector of fixed effects (for example, the first ten principal components) with its incidence matrix X, gG is a vector of aggregate effects of all autosomal SNPs with var(gG) = AGσG2, and AG is the GRM estimated from all autosomal SNPs. The proportion of variance explained by all autosomal SNPs is defined as , with being the phenotypic variance.

Furthermore, we estimated the GRM from the SNPs on each chromosome (AC) and estimated the variance attributable to each chromosome by fitting the GRMs of all the chromosomes simultaneously in the model , where gC is a vector of genetic effects attributable to each chromosome and var(gC) = ACσC2 (joint analysis). The proportion of variance explained by each chromosome is defined as . We also fitted one chromosome at a time in the model y = + gC + ɛ (separate analysis). If there is an effect of population structure, SNPs on one chromosome will be correlated with the SNPs on the other chromosomes such that will be overestimated in the separate analysis.

We extended the joint analysis of chromosomes to that of genomic segments. We divided the genome evenly into NS segments with each of dS Mb length and then estimated the GRM using the SNPs on each segment. We estimated the variance explained by each segment by fitting the GRMs of all the segments in an MLM where gS is a vector of genetic effects attributable to each segment.

We further partitioned the variance explained by all the SNPs onto genic and intergenic regions of the whole genome and as well as that of each chromosome The gene boundaries were defined as ±dg kb away from the 3′ and 5′ UTRs. We estimated and by fitting all the genic and intergenic SNPs in an MLM y = + gGg + gGi + ɛ, and estimated and by fitting the genic and nongenic SNPs on individual chromosomes in the model

We estimated the variance attributable to the X chromosome using the method we recently developed9. In brief, we estimated the GRM for the X chromosome (AX) using the following equations

for a male-male pair,

for a female-female pair and

for a male-female pair, where and are the number of copies of the reference allele for an X chromosome SNP for a male and a female, respectively, pi is the frequency of the reference allele and N is the number of SNPs. Assuming the male-female genetic correlation to be 1, the X-linked phenotypic covariance is for a male-male pair, for a female-female pair or for a male-female pair22,41, where and are X-linked genetic variances for males and females, respectively. Assumptions about inactivity of the X chromosome (dosage compensation) imposed a relationship between and which allow a single variance component to account for the X-linked genetic variance for both sexes. Therefore, we can express the X-linked phenotypic covariances as and where d is the lyonization coefficient, , which takes 1 under the hypothesis of equal X-linked genetic variance for both sexes, takes under the hypothesis of no dosage compensation (both X chromosomes are active for females) and takes under the hypothesis of full dosage compensation (complete inactivity of one X chromosome for females) (Supplementary Note). In the analysis of MLM, we took the lyonization coefficient into account by parameterizing the raw AX matrix, meaning for male pairs, for female pairs and for male-female pairs. We estimated under the three hypotheses by fitting the parameterized GRM for the X chromosome conditional on the GRM estimated from all autosomal SNPs in an MLM where gX is a vector of X-linked genetic effects with

Variance attributed to population structure.

Mixed linear model methods are useful to control for population structure in GWAS42,43. Population structure in the data causes correlations of SNPs on different chromosomes. Consequently, fitting only one chromosome in the model (separate analysis) also captures some of the variance caused by other chromosomes so that the estimate of variance explained by each chromosome from the separate analysis is biased upwards. The joint analysis has the advantage of protecting against such inter-chromosomal correlations because the estimate of each is conditional on the other chromosomes in the model so that the estimates of variance explained by different chromosomes are independent of each other. We therefore can calculate the variance attributable to population structure by comparing the estimates between (sep) and . The inter-chromosomal SNP correlations occur for two reasons: (i) cryptic relatedness (for example, unexpected cousins), because closely related individuals will share SNPs identical by descent on more than one chromosome; or (ii) systematic difference in allele frequencies between subpopulations (population stratification). We modeled the variance attributed to these two forms of population structure as where the slope b1 allows the for possibility that longer chromosomes track population structure better than smaller chromosomes.