Main

Ankylosing spondylitis is a common cause of inflammatory arthritis, with a prevalence of 5 per 1,000 in European populations1. It is characterized by inflammation of the spine and sacroiliac joints causing pain and stiffness and ultimately new bone formation and progressive joint ankylosis. Hip and peripheral joint arthritis is common, and inflammation may also involve extra-articular sites such as the uveal tract, tendon insertions, proximal aorta and, rarely, the lungs and kidneys. The disease is strongly associated with the gene HLA-B27; however, only 1%–5% of HLA-B27-positive individuals develop ankylosing spondylitis, and there is increasing evidence to suggest that other genes must also be involved2,3,4,5. Association has previously been confirmed between ankylosing spondylitis and SNPs in IL23R at chromosome 1p23 and ERAP1 (previously known as ARTS-1) at chromosome 5p15 (ref. 6), and linkage has been demonstrated at genome-wide significance to chromosome 6p (where HLA-B is encoded) and chromosome 16q (lod score 4.7)7. We report here the first genome-wide association study (GWAS) for ankylosing spondylitis.

To identify ankylosing spondylitis susceptibility genes, we performed a GWAS in a sample of ankylosing spondylitis cases among Australian, British and North American individuals of European descent (n = 2,053 in the final data set), using data from previously genotyped, ethnically matched British and North American individuals as controls (n = 5,140). Cases were genotyped with Illumina HumHap370 genotyping chips; 288,662 SNPs were available for study that were common to case and all control data sets after quality-control filtering (see Online Methods). After data cleaning, a modest overall inflation of test statistics remained, with a genomic inflation factor (λ) of 1.06 (ref. 8), excluding SNPs in the MHC (Supplementary Fig. 1). We then genotyped a total of 163 SNPs in a replication cohort of 898 British ankylosing spondylitis cases and 1,518 unselected British controls. The SNPs genotyped included 49 ancestry-informative SNPs and 114 SNPs in 105 chromosomal regions selected from the discovery sample on the basis of their strength of association in that sample and because of close proximity to genes of biologically plausible involvement in ankylosing spondylitis (Supplementary Table 1). Of the confirmation SNPs, 102 markers from 95 regions passed quality control filters and are reported here.

As expected, SNPs in the MHC on chromosome 6p were strongly associated with ankylosing spondylitis (rs7743761 P = 5.0 × 10−304). Association was evident across a very broad region surrounding the MHC, including five SNPs lying in a 153-kb region at 26.0–26.1 Mb from the p-telomere (5.4 Mb from HLA-B), which achieved P < 10−5. The most associated SNP in this region was rs3734523 (P = 1.6 × 10−6). However, conditional logistic regression analysis suggested that this was unlikely to represent a separate independent association because conditioning on five of the most significant SNPs from the MHC (rs7743761, rs2596501, rs3915971, rs2516509, rs1265112) caused the association to disappear (P = 0.27).

Excluding the MHC and surrounding regions, 25 SNPs from six independent loci were significantly associated with ankylosing spondylitis, including the known ankylosing spondylitis–associated genes ERAP1 and IL23R, and two new loci, chromosomes 2p15 and 21q22 (Table 1 and Supplementary Fig. 2). We also observed strong association within two more genes, ANTXR2 and IL1R2, with support in both the discovery and confirmation data sets.

Table 1 Genome-wide significant loci typed in both discovery cohort and replication study

Both non-MHC genes previously associated with ankylosing spondylitis, ERAP1 and IL23R, were significantly associated in this data set. The most strongly associated SNPs were rs30187 (P = 2.6 × 10−11) and rs11209026 (P = 9.1 × 10−14), confirming the strong association observed for these SNPs in the initial discovery set6.

We used SNP imputation to investigate association strength at untyped markers of the six non-MHC loci associated with ankylosing spondylitis. Considering IL23R, only marginally stronger association was observed with one imputed SNP (rs11465817, P = 1.2 × 10−10) than with the strongest associated genotyped SNP, rs11209026 (P = 2.3 × 10−9) (Fig. 1a). IL23R has ten exons, with marker rs11209026 encoding a Q381R substitution in exon 9, and rs11465817 falling in intron 9, suggesting that this is the critical region involved in the association of IL23R with ankylosing spondylitis.

Figure 1: SNP association plots for ankylosing spondylitis–associated regions.
figure 1

Discovery cohort association significance is plotted against the left hand y axis as −log10 (P-value). Genetic coordinates are as per NCBI dbSNP genome build 128 (October 2007). Filled circles, genotyped SNPs; open diamonds, imputed SNPs; color scale, LD; purple dotted line and right y axis, recombination rate (cM/Mb as per HapMap data). Positions of gene exons and ESTs are indicated below the x axis, with their direction of translation (gray arrows). (a) Chromosome 1p31 region. SNP association plot for a 295-kb region (67,325 kb to 67,620 kb) of chromosome 1. LD is in relation to marker rs11209026. (b) Chromosome 2p15 region. SNP association plot for a 295-kb region (62,300 kb to 62,595 kb) of chromosome 2. LD is in relation to marker rs10865331. (c) Chromosome 5q15 region. SNP association plot for a 258-kb region (96,000 kb to 96,258 kb) of chromosome 5. LD is in relation to marker rs30187. (d) Chromosome 21q22 region. SNP association plot for a 245-kb region (39,350 kb to 39,595 kb) of chromosome 21. LD is in relation to marker rs2242944.

In ERAP1, the imputed data revealed a block of SNPs lying in a 4.6-kb region between rs27529 (in exon 9) and rs469758 (in intron 12) achieving P < 10−11, more than 50 times more significant than any other imputed SNP (Fig. 1b). In this region, only marker rs30187 is coding (R528K). It has previously been demonstrated that rs30187 causes a significant reduction in aminopeptidase activity toward a synthetic peptide substrate as well as alterations in substrate affinity9. Molecular modeling of the ERAP1 protein suggests that Arg528 lies at the mouth of the putative enzyme substrate pocket, perhaps explaining the lower aminopeptidase activity of this genetic variant. ERAP1 variants also correlate significantly with expression. Strong cis-regulation of ERAP1 expression in lymphoblastoid cell lines was seen from SNPs close to and within ERAP1, including the marker rs30187 (C allele reduced expression, P = 0.00015)10. In our study, we saw no difference in ERAP1 expression in peripheral blood mononuclear cells (PBMCs) from ankylosing spondylitis cases compared with controls (Supplementary Table 2), suggesting that this is a less likely explanation of the mechanism of association of ERAP1 with ankylosing spondylitis.

Three SNPs at the 2p15 locus achieved genome-wide significance in the discovery set: rs10865331 (P = 6.1 × 10−15), rs10865332 (P = 3.5 × 10−10) and rs4672503 (P = 9.3 × 10−10). No imputed SNP was more significantly associated than rs10865331. In the replication study we genotyped two SNPs in this locus, both of them confirming the discovery set findings: rs4672495 (P = 8.4 × 10−4) and rs10865331 (P = 5.5 × 10−6). The combined level of association of these SNPs was highly significant: rs4672495 (P = 3.2 × 10−9) and rs10865331 (P = 1.9 × 10−19). Combining the imputed and genotyped data, there is a block of SNPs lying between marker rs10865331 and rs4672507 in tight linkage disequilibrium (LD) (r2 > 0.8) with >1,000 times stronger significance than any other SNP at this locus, encompassing a 23-kb region likely to contain the causative variant(s) responsible for the association observed (Fig. 1c). No genes are encoded within this region, the nearest gene to the most strongly associated marker rs10865331 being 100 kb distant (B3GNT2). We are not aware of this region being associated previously with any known disease. B3GNT2 encodes UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltransferase 2, a protein not as yet known to have any immunological function.

At chromosome 21q22, three SNPs across an 11-kb region achieved genome-wide significance in the discovery cohort: rs2242944 (P = 2.7 × 10−14), rs2836878 (P = 4.9 × 10−12) and rs378108 (P = 6.1 × 10−11) (Fig. 1d). SNP rs2242944 also showed strong association in the confirmation cohort (P = 5.6 × 10−7) and in the combined analysis (P = 8.3 × 10−20). The nearest gene to the most strongly associated SNP, rs2242944, is 82 kb distant (PSMG1, proteasome assembly chaperone 1). This region has recently been associated with pediatric-onset inflammatory bowel disease (IBD), in which the most strongly associated SNP was rs2836878; positive association was seen with over-representation of the minor allele, as was the case in our ankylosing spondylitis data set (P = 4.1 × 10−10). This SNP is in strong LD with the strongest ankylosing spondylitis–associated marker, rs2242944 (r2 = 0.6, D = 1)11. Increased expression of PSMG1 was observed in colonic biopsies from IBD cases, and it was suggested that this may be the gene involved at this locus. Ankylosing spondylitis and IBD are closely related conditions, with 70% of those with ankylosing spondylitis having microscopic terminal ileitis resembling Crohn's disease12 and 10% of those with IBD having ankylosing spondylitis. Crohn's disease and ankylosing spondylitis are each associated with IL23R SNPs, and it is likely that further shared genetic susceptibility factors exist. We saw strong association even among those cases with no clinical IBD (n = 1,159 cases, rs2242944, P = 1.3 × 10−9), indicating that the association was present even in cases of primary ankylosing spondylitis in the absence of clinically manifest IBD.

PSMG1 was not differentially expressed in PBMCs from cases with active ankylosing spondylitis compared with healthy controls (Supplementary Fig. 3), nor in relationship to ankylosing spondylitis–associated chromosome 21q22 SNPs. A large recombination hotspot lying between PSMG1 and the ankylosing spondylitis–associated SNPs makes it unlikely that the association signal observed is due to effects from SNPs located in or close to PSMG1. We feel that its remoteness to the associated locus, absence of differential expression with disease, and lack of evidence of a relevant biological function make it an unlikely candidate to be directly involved in ankylosing spondylitis susceptibility. Rather, we hypothesize that the chromosome 2p15 and 21q22 regions harbor either noncoding RNA species or hitherto unreported protein-coding genes that are likely to be involved in susceptibility to ankylosing spondylitis. To investigate this further, we performed a transcriptome-wide profiling study of expressed sequence tags and small RNAs derived from PBMCs from four active ankylosing spondylitis cases and three healthy controls using Illumina's deep sequencing approach. No small regulatory RNAs such as microRNAs were seen within the regions of highly associated SNPs at either locus, although, consistent with recent findings13, these were identified in association with transcription start sites of flanking genes outside the disease-associated region (Supplementary Fig. 4). At both loci, we identified sequence tags derived from long RNAs. These either represent long mRNA-like noncoding RNA species or, alternatively, previously undescribed mRNA isoforms originating from distal promoters of adjacent protein-coding genes.

Fourteen SNPs in a 61-kb region encompassing IL1R2 achieved nominal significance, with the strongest association observed with genotyped markers at rs2310173 (P = 8.3 × 10−6) and with imputed markers at rs10185424 (P = 5.4 × 10−6) (Supplementary Fig. 3a). Marker rs2310173 was also associated with ankylosing spondylitis in the replication study (P = 0.018) and showed a high level of significance in the combined analysis (P = 4.8 × 10−7). IL-1R2 is cleaved from cell membranes, possibly by ERAP1 (ref. 14) and acts as a decoy receptor, interfering with the binding of IL-1 to IL-1RI. One possible explanation for the associations of ERAP1 and IL1R2 with ankylosing spondylitis is that the disease-associated genetic variants affect cleavage of IL-1R2 from the cell surface. In this respect, we note that several SNPs in TNFRSF1A achieved moderate levels of association in the discovery set (strongest associated SNP, rs1800693, P = 6.9 × 10−5). TNFRSF1A encodes tumor necrosis factor receptor 1, which may also be cleaved from the cell surface by ERAP1 (ref. 15). No support for this association was seen in our replication study, but SNPs in TNFRSF1A have been associated with both ulcerative colitis and Crohn's disease previously16,17, providing some support for this association with ankylosing spondylitis. Tumor necrosis factor overexpression in mice leads to inflammatory bowel disease and to sacroiliitis resembling ankylosing spondylitis, and is dependent on expression of TNFRSF1A (ref. 18).

ANTXR2, recessive mutations of which cause juvenile hyaline fibromatosis (MIM228600) and infantile systemic hyalinosis (MIM236490), encodes capillary morphogenesis protein-2 (CMP2). The SNP rs4333130 was associated with ankylosing spondylitis in both the discovery cohort (P = 7.5 × 10−7) and replication cohort (P = 0.029) as well as overall (P = 9.3 × 10−8). In the imputed data set, no markers were more strongly associated (Supplementary Fig. 3b). A functional explanation for this association with ankylosing spondylitis is not clear.

The power of this study to detect small to moderate genetic effects was modest. We calculate that the discovery phase of the study has 2%–21% power to identify SNPs conferring an additive allelic odds ratio of 1.2 with minor allele frequencies of 0.1–0.5 at α = 5 × 10−7, assuming D′ = 0.9 and the marker and disease-associated allele frequencies are equal. Further GWAS with larger sample sizes will therefore be useful and likely to identify more genes associated with ankylosing spondylitis. The identification of four genetic loci newly associated with ankylosing spondylitis extends our understanding of the genetic etiology of this disorder and provides an important foundation for future hypothesis-driven research into the pathogenesis of this common and debilitating condition.

Methods

Study participants.

The discovery sample population included 69 Australian, 1,129 British and 983 North American individuals as ankylosing spondylitis cases, and 5,847 controls derived from the 1958 British Birth Cohort genotyped by the Wellcome Trust Case-Control Consortium (n = 1,436), and the Illumina iControlDB database (n = 4,149). HapMap CEU (Utah residents with Northern and Western European ancestry), YRI (Yorubans from Ibadan, Nigeria), JPT (Japanese from Tokyo) and CHB (Chinese from Beijing) samples (n = 262) were added to the controls to use for quality control. Ankylosing spondylitis was defined according to the modified New York criteria19. After quality-control checks including assessment of cryptic relatedness, ethnicity and genotyping quality, 2,053 ankylosing spondylitis cases and 5,140 controls were available for analysis. The confirmation cohort included 898 British ankylosing spondylitis cases and 1,518 unselected controls. The confirmation healthy controls were provided by the Avon Longitudinal Study of Parents and Children, the makeup and genotyping of which are reported elsewhere20. All cases were of self-reported white European ancestry. All case and control participants gave written, informed consent, the study was approved by the relevant research ethics authorities at each participating center and overall approval was given by the University of Queensland Research Ethics Committee.

Genotyping.

Genotyping of the discovery cohort was performed using Illumina Infinium II HumHap370CNV chips at the Diamantina Institute. Genotype clustering was performed using Illumina's BeadStudio software; all SNPs with quality scores < 0.15, and all individuals with <98% genotyping success, were excluded. We manually inspected cluster plots from the 500 most strongly associated non-MHC loci and the 100 most strongly associated MHC loci, and excluded poorly clustering SNPs from analysis. For the replication study, genotyping was performed either by iPLEX technology (MassArray, Sequenom) or by KASPar technology (Kbiosciences).

Illumina iControlDB genotypes included data from eight different Illumina chip types (HapMap370 DUO, Human1Mv1, Human610-Quadv1, HumanHap300, HumanHap550-2v3, HumanHap550v1, HumanHap550v3, HapMap370 Single). All SNPs were then converted to TOP strand using information in the Illumina manifest files and further corrected for any updates caused by changes from the NCBI35 and NCBI36 genome builds. (iControlDB SNPs are on the forward strand; Wellcome Trust Case-Control Consortium controls and genotyped cases are on the TOP strands.) The case and control genotypes were then merged using PLINK (http://pngu.mgh.harvard.edu/purcell/plink). A subset of SNPs common to all chip types was then extracted, leaving 301,866 SNPs (from a maximum of 317,502 available). Related individuals were then excluded (n = 410 samples) containing 62 cases and 135 of the HapMap quality-control samples that consisted of related members of family trios. Outliers in a plot of heterozygosity versus missingness were then removed (n = 122, containing 29 cases). SNPs with minor allele frequency < 1% (n = 135) or missingness > 5% (n = 2,226) and those not in Hardy-Weinberg equilibrium (P < 10−7, n = 11,337) were then removed, leaving 288,662 SNPs in total.

To detect and correct for population stratification we used the EIGENSTRAT software9. We first excluded the 24 regions of long-range LD including the MHC identified in ref. 21 before running the principal components analysis, as suggested by the authors. We observed no spurious associations or clustering within the first ten eigenvectors associated with case/control, chip type, or origin of the cases and controls. Controlling for more than four principal components did not alter the genomic inflation factor, so genome-wide analysis was performed controlling for these components only. To further validate our findings, we ran PLINK and PLINK permutation on a subset of the data, containing 1,905 cases and 3,885 controls, from which ethnic outliers had been excluded. In these analyses, λ = 1.16 with the MHC excluded. We obtained similar results to those obtained using EIGENSTRAT, with P-values of similar magnitude for the regions described here; no additional informative associations were found.

We used a model containing 49 SNPs to predict ethnicity in the confirmation subjects (see Supplementary Table 3). The model was constructed to remove 'obvious' ethnic outliers of Asian and African origin with few false negatives. Development of this model is described briefly here and involves variable selection, training, validation and prediction. Ethnicity-informative SNPs were selected from genotype data of 166 HapMap samples of CEU, YRI, CHB and JPT origin with 3,698 SNPs. These SNPs were obtained from the Alfred database (see URLs) and from a list of SNPs indicated by the principal components analysis of our discovery samples to be associated with population stratification. We then applied a Bayesian variable selection algorithm22 encoded with the RchipLite software developed by CSIRO Bioinformatics (CSIRO Mathematical and Information Sciences, unpublished data) to reduce the number of ethnicity-informative SNPs to 49. A predictive model was constructed using these 49 SNPs using a Random Forest algorithm trained with 250 samples selected from our discovery set, which from principal components analysis consisted of individuals with European, Asian or African ancestry. The model was tested on an independent set of 1,044 samples genotyped on HapMap 650Y chips found in the Illumina icontrolDB database labeled with 52 ethnic groups (Supplementary Fig. 5). We found the model to be suitable for our objective of reliably excluding samples of very Asian or African descent. Similar results can be obtained with fewer SNPs and more selective models can be constructed using this model, though these have a higher proportion of false positives. When we applied the model to our confirmation set, we found 2 samples of Asian origin and 11 of African origin, which were subsequently excluded.

Replication study markers were selected from SNPs that achieved P < 10−4. Nine SNPs were typed in the gene TNFRSF1A because it was of particular functional interest. Two other SNPs were selected on the basis of the gene function: rs2276645 (in ZAP70) and rs4677035 (in FOXP1). All other SNPs were selected on the basis of rank order of P-value. Upon replication genotyping, 6 SNPs were excluded for >10% missingness and 6 SNPs for not being in Hardy-Weinberg equilibrium (P < 0.01), leaving 102 SNPs in the final analysis, not including ancestry-informative markers. Case-control analysis was then performed using the Cochrane-Armitage test of trend as implemented in PLINK.

Meta-analysis of the discovery and replication cohorts was performed using a weighted z-score based method as implemented in the METAL program23. Briefly, two-sided P-values from the discovery and replication sets were converted to signed z-scores that reflected the direction of association given the reference allele. Each z-score was weighted by a quantity proportional to the square root of the effective number of individuals in the sample, with weights chosen so as to sum to 1. Weighted z-scores were then summed across studies and the summary z-score converted to a two-sided P-value.

Imputation analyses were carried out using Markov Chain Haplotyping software (MACH) using phased data from CEU individuals from release 22 of the HapMap project as the reference set of haplotypes. We analyzed only SNPs surrounding disease-associated SNPs that were either genotyped or could be imputed with relatively high confidence (R2 ≥ 0.3). Association analysis of imputed SNPs was performed assuming an underlying additive model and including the first four EIGENSTRAT eigenvectors as covariates, using the software package MACH2DAT (see URLs), which accounts for uncertainty in prediction of the imputed data by weighting genotypes by their posterior probabilities. Power calculations were performed using the Genetic Power Calculator (see URLs).

Gene expression.

For the microarray experiment, we studied 28 subjects classified as having ankylosing spondylitis according to the modified New York criteria19, and 28 age- (±5 years) and sex-matched controls. Cases had active disease, with Bath Ankylosing Spondylitis Disease Index (BASDAI) scores of >4.0 and/or C-reactive protein >10 mg/l and/or erythrocyte sedimentation rates > 25 mm/h. cRNA samples were hybridized to Illumina HumanHT-12 V3 Expression BeadChips (Illumina) according to the manufacturer's protocol. Samples were processed in two batches, one of 34 samples (17 cases and 17 matched controls) and one of 22 samples (11 cases and matched controls), with samples in each batch processed in parallel for hybridization using the Direct Hyb Assay and read on an Illumina BeadArray Reader. Array data were processed using the Illumina BeadStudio software then the processed data analyzed using BRB-ArrayTools version 3.8 (see URLs). Data from the two batches was then combined by meta-analysis using the R package GeneMeta24.

Construction of small-RNA libraries and ultrahigh-throughput sequencing.

Total RNA was extracted from PBMCs from ankylosing spondylitis samples and healthy controls using Trizol reagent (Invitrogen) according to the manufacturer's instructions. Ten micrograms of total RNA were used for construction of small RNA sequencing libraries using Illumina's Small RNA Sample Preparation kit version 1. One microgram of total RNA was used for construction of EST libraries using Illumina's NIaIII Tag Sample preparation kit. Sequencing was performed on the Illumina Genome Analyzer II using Illumina's reagents. Library preparation and sequencing were performed according to the manufacturer's protocols.

Analysis of sequencing data.

Base calling, sequence-read quality assessment and alignment of sequence reads to the reference human genome (hg18; UCSC Genome Browser) were performed using Illumina's Data Analysis Pipeline software version 1.3. Subsequent sequence data analyses were carried out as described13.

URLs.

MACH, http://www.sph.umich.edu/csg/abecasis/MACH/; Genetic Power Calculator, http://pngu.mgh.harvard.edu/~purcell/gpc/; Alfred Allele Frequency Database, http://alfred.med.yale.edu/alfred/; BRB-ArrayTools, http://linus.nci.nih.gov/BRB-ArrayTools.html; UCSC Genome Browser, http://genome.ucsc.edu/.