Introduction

Multiple inquiries into the genetic etiology of complex human traits have indicated that, for a number of phenotypes, the genetic variants affecting continuous, polygenic phenotypic variation may be concentrated in the same genes as those giving rise to monogenic (ie, Mendelian) disorders. For instance, 180 loci associated with normal variation in the classic polygenic trait of adult height were shown to be enriched in genes underlying skeletal growth disorders.1 Many rare genetic variants in three candidate genes (ABCA1, APOA1, and LCAT), which give rise to pathogenically low levels of HDL cholesterol in plasma, are also found in individuals with the common, complex version of the low-HDL-cholesterol trait.2, 3, 4 Genes underlying Mendelian disorders of lipid levels, and those affecting their normal concentration overlap almost entirely.5 Other examples include hemoglobin F levels,5 fat mass,6 type 2 diabetes,5, 7, 8 and Parkinson’s disease.9, 10

Genes underlying Mendelian disorders, in which protein functioning is severely altered, may therefore provide an opportunity to localize and understand the genetic variability that underlies susceptibility to a similar common polygenic phenotype.2 In the present study, we utilize this idea to examine the effects of 43 genes implicated in autosomal recessive cognitive disorders11 on intelligence in a Dutch sample from the general population (N=1316; see Materials and Methods, and Supplementary Figures S1 and S2). Despite its being one of the most heritable human traits (with heritability estimates ranging from 0.6 to 0.8 in adolescence and adulthood12, 13), no loci consistently associated with normal-range variation in intelligence have thus far been reported.13, 14, 15, 16 The two largest genome-wide association studies (GWAS) to date failed to find replicable genome-wide association in SNP-based analyses in adults and children, respectively.15, 17 The 43 genes considered in the present study are a subset of the genes identified in a recent study that used homozygosity mapping, exon enrichment, and next-generation sequencing in consanguineous families with autosomal-recessive intellectual disabilities to identify single, presumably disease-causing variants in 50 novel candidate genes.11 The genome-wide data set of the Netherlands Twin Register (NTR),18 used in the present study, contains SNP data on 43 of these 50 genes (Table 1), including 1227 genotyped SNPs in total (Supplementary Table S1).

Table 1 Chromosomal position (hg19), length, and number of genotyped SNPs for the 43 genes

Materials and Methods

Sample

The data were obtained from the NTR.18, 19 The NTR is a population-based register of Dutch twins recruited at birth and measured longitudinally at ages 1 through 18. The sample consisted of 1316 individuals from 662 families (978 twins, 231 siblings, and 107 of their parents). To keep the genetic within-family covariance matrix approximately compound symmetric (ie, to keep the genetic covariances between each type of relatives approximately equal), the data were selected so as to contain no complete MZ twin pairs and no more than one parent per family. Thus, each family consisted of individuals who were genetically either siblings or parent-offspring, that is, the expected genetic correlation between any given pair of family members was 0.5. The observed intraclass correlation between the family members was 0.57 (SE=0.025). In all, 45.8% of the sample were males. The mean ages of children and parents were 12.7 (SD=4.1) and 43.9 (SD=4.1), respectively. The age distribution (showing each participant’s mean age across measurement occasions) is given in Supplementary Figure S1.

Phenotype data

Intelligence was assessed longitudinally using the Revised Amsterdam Children Intelligence Test (RAKIT20), Wechsler Intelligence Scale for Children (WISC21, 22, 23), Raven’s Standard and Advanced Progressive Matrices (SPM, APM24, 25), and the Wechsler Adult Intelligence Scale (WAIS26, 27), the choice of test being largely dependent on participants’ age. A previous study employing the same data set demonstrated a high genetic stability of intelligence scores as assessed by the different tests (the autoregressive coefficients between the additive genetic factors at consecutive measurement occasions ranging from 0.8 to 1).28 Therefore, the individuals’ mean scores across the different ages were used as measure of the phenotype. The IQ scores were derived based on the age- and sex-appropriate norms for the RAKIT, WISC, or WAIS, and subsequently converted to z-scale within each measurement occasion and averaged over measurement occasions. For the 154 participants for whom only the Raven's Matrices scores were available, we used z-transformed scores on Raven’s Matrices. The distribution of intelligence scores is given in Supplementary Figure S2.

Genotype data

Blood and/or buccal samples for DNA extraction were collected as part of several projects within the NTR. Genotyping was performed using the Affymetrix Human SNP Array 6.0. Genotypes were called using the BIRDSEED V2 algorithm. SNPs in Hardy-Weinberg equilibrium (P>0.00001) with a minor allele frequency exceeding 0.01 and a missingness rate below 5% were included in the analyses. Samples were selected if their call rate exceeded 95% and were checked for Mendelian errors, excessive heterozygosity (−0.1<F<0.1), and discrepancies in relatedness.29 Genotypes displaying Mendelian inheritance errors were excluded from the analyses.

For the present study, we selected all genotyped SNPs from the 50 genes of interest, including a 5-kb border around each gene. In all, 7 out of the 50 genes contained no genotyped SNPs. The distribution of the SNPs (1227 in total) over the remaining 43 genes is shown in Table 1. The full list of SNPs is given in Supplementary Table S1. In all, 0.85% of the SNPs were in exonic regions, and out of that, 61.3% were non-synonymous.

Analyses

SNP-based analyses

As a first step, we tested for an association between the phenotype and each of the 1227 SNPs. As the observations were clustered in families, the analyses were performed using a multilevel regression model with random intercepts to account for the within-family covariance structure. Specifically, the model for phenotype of person i in family j was phij=b0j+b1*SNPij+resij, where ph denotes phenotype, b0j is intercept in family j, b1 is a (fixed) slope parameter, and resij denotes an individual-specific residual term. The intercept term can be further decomposed as b0j=g0+k0j, where g0 is a fixed component and k0j is a component that is random over families. Using random intercepts prevents the inflation of type I error associated with applying a standard (fixed-effects) regression model to family-clustered data. The within-family genetic covariance structure was approximately compound symmetric (ie, the expected genetic correlation between any given type of relatives was 0.5). The analyses were implemented using the 'nlme' package in R.30 The code used to carry out the analyses is given in Script S1.

Additionally, we performed association testing using the Plink software package.31 Here the association between the phenotype and each of the 1227 SNPs was examined using the Huber-White sandwich variance estimator to account for the family structure in the data. The results were compared with those obtained using the multilevel regression model in R. A high degree of correspondence between the results obtained using the multilevel regression model (which effectively assumes an AE background covariance structure among first-degree relatives) and those obtained using the Huber-White sandwich estimator (which corrects for relatedness without assuming a background model) would imply that any background misspecification in the random effects model has not affected the conclusions. A high degree of correspondence is expected, because the test of a fixed effect in the multilevel regression model is fairly robust to possible background misspecification.32

To empirically evaluate the results obtained for the 1227 SNPs, we drew a number of random samples of: (a) 1227 SNPs from the entire genome, (b) 1227 SNPs from intragenic regions of the genome, and (c) 43 genes (including all SNPs on those genes) from the autosomal genome. All samples excluded the 1227 SNPs of interest. Each of the random samples was subjected to the analyses described above. The resulting QQ plots and genomic inflation factors (λ) were compared to those obtained for the 1227 SNPs of interest.

As additional verification of the results, permutation was employed to generate an empirical distribution of λ values under the null hypothesis of no association. The genotypes (ie, the 1227 SNPs) were randomly reallocated over the phenotypes 1000 times, and each of the 1000 permuted data sets was analyzed using the random intercept multilevel regression model described above. To account for the background covariance structure arising from the clustering of data in families, family data were relocated jointly: the genotypes of any two-member family were reassigned to phenotypes of another randomly selected two-member family, and the same was done for three- and four-member families. Thus, the family structure in the permuted data sets remained intact. As in the original analyses, the family structure was subsequently corrected for using a multilevel model. The null distribution of λ values generated using the permuted data sets was compared to the λ obtained for the 1227 SNPs of interest.

Finally, a genome-wide association study was performed. Here, the phenotype was regressed on each of the available genotyped SNPs (538652 SNPs) using the Plink software package.31

All analyses were performed using an additive model and included eight genomic principal components33 as covariates to account for any possible effects of population stratification. All λ values were estimated as regression coefficients of the observed on the expected -log10 of the P-values, using the GenABEL package in R.30

Gene-based analyses

In the next step, the SNP-based P-values obtained using the multilevel regression model were used as input for gene-based analysis. A gene-based association test that employs the extended Simes procedure (GATES) was used.34 GATES involves jointly analyzing all available SNPs in a gene to obtain a single P-value associated with the gene. The method assumes that an association test between the phenotype and all available SNPs on the gene has been carried out, and that the resulting P-values and pair-wise allelic correlation coefficients r for all SNPs are available. In the present case, we used the P-values obtained in the SNP-based multilevel regression analysis, and pair-wise allelic correlation coefficients obtained using the −r option in Plink. Given m SNPs on a gene, a gene-based P-value is obtained through an iterative procedure by combining the ascendingly ordered m P-values in the following way: PG=min (mep(j)/me(j)), where me is the effective number of independent P-values among the m SNPs, me(j) is the effective number of independent P-values among the top j SNPs (j=1, …, m), and pj is the j-th lowest P-value (ie, the P-value associated with the j-th top SNP). The null hypothesis of this gene-based test is that none of the SNPs are associated with the phenotype; the alternative is that at least one SNP is associated. The effective number of independent P-values among the m SNPs, me, is estimated as , where I(x) is an indicator function and λi is the i-th eigenvalue of the mxm correlation matrix (ρ) of the P-values obtained in the SNP-based association test. The pair-wise P-value correlation coefficient, ρij, can be approximated by a 6th order polynomial function of the allelic correlation coefficient rij: ρij =0.2982rij6−0.0127rij5 +0.0588rij4 +0.0099rij3 +0.6281rij2−0.0009rij, where ρij and rij are the ij-th elements of the SNP P-value correlation matrix ρ, and of the allelic correlation matrix r, respectively. For a full overview of the method, we refer the reader to the original publication34 and to Script S2, which contains our implementation GATES in R. The R script performs the test k times given k genes in the input file.

Additionally, we performed a gene-based association test using the Versatile Gene-Based Test for Genome-wide Association Studies (VEGAS),35 and compared the results to those obtained using GATES. VEGAS is a simulation-based method that uses information from the full set of SNPs within a gene and accounts for linkage disequilibrium (LD) by using simulations from the multivariate normal distribution. The analyses were performed using the VEGAS web-interface.35

Results

SNP-based analyses

Association between intelligence scores and each of the 1227 SNPs (see Materials and Methods) was examined using an additive model and eight principal components33 to account for the possible effects of population stratification (Script S1). The left panel of Figure 1 shows a quantile-quantile (QQ)-plot, including 95% confidence intervals (CIs), of the association P-values (also see Supplementary Figures S3 and S4, and Supplementary Table S2). The CI estimates were obtained while taking into account the LD structure between the SNPs: instead of N=1227, we used an estimate of the effective number of independent P-values (N=625). This approach produces relatively broader CIs; we thus adopt a more stringent approach to evaluate the significance of the difference between the expected and the observed distributions. As evident from the figure, the distribution of the observed P-values differs markedly from that expected under the null hypothesis of no effect, indicating an enrichment of the 43 candidate genes for polymorphisms associated with intelligence. Note that the significant inflation along nearly the entire length of the QQ plot (genomic inflation factor λ=1.26) is not necessarily indicative of population stratification, particularly in the context of a candidate SNP study. Here, the observed inflation is expected under the alternative hypothesis of (polygenic) effects of a relatively large number of the candidate SNPs tested.36 As the analyses were performed while adjusting for eight principal components (seven of which were correlated with geographic latitude and longitude in the present sample,33 thereby feasibly representing differences in ancestry), population stratification does not appear to be a likely cause of the inflation.

Figure 1
figure 1

Left: QQ-plot based on the 1227 candidate SNPs. Right: Genome-wide QQ-plot based on 538652 SNPs. Dashed lines: 95% confidence intervals (CIs).

To empirically verify the finding and confirm the absence of population stratification, we performed SNP-based association testing on samples of SNPs drawn randomly from the genome. In particular, we drew 1000 random samples of: (a) 1227 SNPs from the entire genome, (b) 1227 SNPs from intragenic regions of the genome, and (c) 43 genes (including all SNPs on those genes) approximately matched for size with the 43 candidate genes and sampled from the entire autosomal genome. All random samples excluded the 1227 SNPs of interest. The distributions of the λ values obtained for each set of random samples, along with the λ obtained for the 1227 SNPs of interest (marked by a horizontal line), are depicted in Figure 2. As evident from the figure (panels a and b), the effect found for the SNPs of interest did not replicate in any of the 2000 random samples obtained by sampling SNPs from the entire genome or from the intragenic regions of the genome. For SNPs residing on randomly sampled sets of 43 genes (panel c, Figure 2), only 3.6% of λ values exceed the λ obtained for the candidate SNPs. Note that the higher variance in panel c of Figure 2 relative to that in Panels a and b is expected given that the degree of non-independence of SNPs (ie, LD) is considerably higher in SNPs sampled from the same gene relative to those sampled from the entire genome. A reduced effective number of independent SNPs is expected to result in a less precise estimate of λ, that is, in a higher dispersion around the mean λ value.

Figure 2
figure 2

Distribution of genomic inflation factors (λ) obtained for 1000 (a) random samples of 1227 SNPs from the entire genome, (b) random samples of 1227 SNPs from intragenic regions of the genome, (c) random samples of 43 genes from the entire genome, and (d) permuted data sets. Horizontal line: λ obtained for in the non-permuted data set for the 1227 SNPs of interest.

As further empirical verification, we performed permutation testing to obtain an empirical distribution of λ values under the null hypothesis of no association: the genotypes (ie, the 1227 SNPs of interest) were randomly reallocated over the phenotypes 1000 times, and each of the 1000 permuted datasets was analyzed using SNP-based association testing. The resulting distribution of λ values and the λ obtained for the non-permuted data set (λ=1.26) are shown in panel d of Figure 2. Here, only 2.9% of the λ values exceed the λ value of interest; an empirical P-value consistent with that obtained from random sampling.

Finally, a genome-wide association analysis was performed. Here, the phenotype was regressed on each of the available genotyped SNPs (538652 SNPs). The resulting QQ plot is depicted in the right panel of Figure 1. As evident from the figure, the genome-wide P-values in the right panel show no notable inflation (λ=1.03), in contrast to the left panel (λ=1.26).

The present results thus consistently indicate an enrichment of the candidate set of genes for polymorphisms associated with intelligence, while plausibly ruling out population structure as the cause of the observed effect. The former is demonstrated by the significant inflation of the association P-values for the candidate set of SNPs as compared with random subsets of SNPs (empirical P=0.036) and to a permutation-based null distribution (empirical P=0.029). The latter is established by (a) the inclusion of genetic principal components into the association study, (b) the near absence of comparable P-value inflation in randomly selected sets of SNPs, and (c) the absence of genome-wide P-value inflation.

Gene-based analyses

Next, gene-based testing was carried out (see Materials and Methods and Script S2). The full list of gene-based results is given in Supplementary Table S3. Genes ELP2 (P=0.007), TMEM135 (P=0.007), PRMT10 (P=0.019), and RGS7 (P=0.044) displayed the strongest associations, although no association was significant after correction for multiple testing (with α=0.05/43). Notably, 2 out of the 50 genes from the Najmabadi et al11 study harbor more than one mutation associated with cognitive disabilities; one of those genes is ELP2, which, in the present study, shows the strongest evidence of association.

Focusing on the four nominally significant genes, we examined the positions of the most strongly associated SNPs in these genes relative to the mutations in Najmabadi et al11 (Supplementary Figure S5). As evident from the figure, both mutations in ELP2, as well as the mutations in TMEME135 and PRMT10, are relatively close to our top SNP for their respective genes; the distances range from 4.8 kb to 31.4 kb. On RGS7, the distance between the mutation and the top SNP is relatively large (535.7 kb). Note that any distance between the disease-causing mutation and our top SNP is consistent with the logic of the present study however, as the gene is viewed as a functional unit with regard to its etiological relevance to intelligence, regardless of the distribution of the functionally relevant polymorphisms along the gene.

For validation, both the gene-based analyses and the SNP-based analyses were performed using several different methods (see Materials and Methods). The results obtained using the different methods converged highly: the log10 of the P-values obtained using two methods of SNP-based testing correlated 0.88, and the P-values obtained using two different gene-based tests correlated 0.89 (Supplementary Table S3).

Discussion

The present study focused on 43 genes implicated in autosomal recessive cognitive disorders in consanguineous Iranian families,11 and found these to be enriched for polymorphisms associated with normal-range intelligence in a Dutch population-based sample. This is a demonstration of the relevance of genes implicated in monogenic disorders of cognitive ability to continuous variability in intelligence. Despite the high heritability of intelligence,12, 28, 37, 38 the progress in the identification of loci consistently associated with variation in its normal range has thus far been limited.15, 17, 38, 39, 40, 41, 42 Exceptions are the apolipoprotein E (APOE) gene at older ages43 and formin binding protein 1-like (FNBP1L), the latter having recently been shown to be associated with both childhood and adulthood intelligence.15, 17 The present approach utilizes the idea that the differentially sized effects of individual mutations located within a gene functionally relevant to the phenotype may range from severe disruptions of protein functioning (resulting in a Mendelian disorder) to smaller effects underlying polygenic variation. Utilizing prior knowledge on genetics of Mendelian disorders may therefore prove a valuable approach to the identification of genetic variability underlying polygenic traits, with the advantage of requiring considerably smaller sample sizes than GWAS to achieve adequate power. This may prove especially useful in the study of phenotypes for which large samples are difficult to obtain, for instance because the phenotype is difficult or costly to measure (eg, neuropsychological or fMRI measures), and/or in detection of genetic variants characterized by small effect sizes. For instance, in the present study we clearly demonstrate enrichment, although none of the P-values for individual SNPs fall below the Bonferroni-corrected significance threshold (α=0.05/1227=0.00004, or α=0.05/625=0.00008 if one corrects by the number of independent SNPs34), indicating that the magnitudes of individual SNP effects are too small to be detected in regular GWAS.

Although larger sample sizes are needed to identify the exact genes and genetic variants driving the association in the present study, we focus on the top four genes that reach nominal significance. The most strongly associated gene, ELP2 (elongator complex protein 2), encodes a subunit of the RNA polymerase II elongator complex,44 involved in acetylation of histones H3 and probably H4 and possibly in chromatin remodeling. TMEM135 (transmembrane protein 135) is involved in fat metabolism and energy expenditure.45 PRMT10 (protein arginine methyltransferase 10) affects chromatin remodeling leading to transcriptional regulation, RNA processing, DNA repair, and cell signaling.46 RGS7 (regulator of G-protein signaling 7) interacts with 14-3-3 protein, tau, and snapin (a component of the SNARE complex required for synaptic vesicle docking and fusion).47

The utilization of knowledge on monogenic disorders to identify polymorphisms that affect the variability of continuous phenotypes is a cost-efficient approach to understand the genetic variability underlying polygenic traits. At present, the causal variants for a large number of monogenic disorders have been identified (over 3000 disorders; Online Mendelian Inheritance in Man (OMIM): http://www.ncbi.nlm.nih.gov/omim), and recent developments in sequencing technologies have made it possible to employ exome sequencing or whole-genome sequencing, possibly in combination with homozygosity mapping, as an efficient approach to identifying novel causal variants underlying Mendelian disorders.48, 49, 50 The National Human Genome Research Institute has opened Centers for Mendelian Genomics (NHGRI Genome Sequencing Program, http://www.genome.gov/), whose primary goal is the discovery of as yet unknown variation underlying Mendelian disorders. Thus, at present, the utilization of existing and impending knowledge on variants underlying Mendelian disorders to identify the variation underlying polygenic traits may prove a viable, efficient, and cost-effective complement to standard approaches such as GWAS. The present finding highlights the importance of continuing the efforts directed at studying monogenic diseases50, 51 at a time when focus has shifted away from them, as they can advance our understanding of multifactorial traits.