Introduction

Basques have been the subject of many studies during the last decades due not only to cultural but also to biological characteristics that place them as an isolated and unique population within Europe. Their language, Basque or Euskara, is not Indo-European, and it is not related to any other extinct or extant language. They have historically settled the Western Pyrenees, in the Franco-Cantabrian region, between Spain and France [1]. The large number of archeological sites from different periods, as well as different genetic studies with present-day and ancient samples, showed that this region acted as one of the most densely populated European glacial refugia during the Last Glacial Maximum. Despite controversial studies [2], a large expansion is thought to have originated in the same area around 12,000 years ago (ya), in the Late Glacial when the climatic conditions started to improve in Europe [3, 4]. Many controversies surround the origins and the population history of Basques, which partial or limited population sampling has certainly not helped to settle. Several studies had suggested that Basques are descendants of an Upper-Paleolithic hunter-gatherer population that remained isolated in the glacial refugia and maintained very low contacts with surrounding populations, avoiding gene flow since the Neolithic [5]. However, recent studies point to gene flow between hunter-gatherers and early farmers from Chalcolithic and Bronze Age in the region [6]. In the general framework created by ancient DNA studies, Basque distinctiveness seems to arise from their reduced steppe ancestry as compared to other European populations, indicating limited contacts with Bronze Age migrants [7]. The genetic uniqueness of Basques within the European continent has been shown from classical markers [8], uniparental lineages [4] and autosomal markers [9]. However, some studies have challenged the genetic differentiation of Basques compared to other European populations [10]. One of the most striking genetic singularity of Basques is related to the Rhesus (Rh) system since they show one of the highest frequencies of the RhD negative allele in human populations [1, 11,12,13].

The human Rh system is a set of antigens that are expressed on the membrane of red blood cells. These antigens are encoded by two homologous genes, RHD and RHCE, located at chromosomal position 1p34.1-1p36 with opposite orientations (Figure S1). Both genes show a high sequence identity (98%) since they originated from an ancestral gene duplication within primates [14, 15]. Among the current 61 Rh antigens recognized by the International Society of Blood Transfusion (ISBT) [16], only five are of extensive interest because of their importance in hemolytic reactions from transfusion incompatibilities, the hemolytic disease of the newborn (HDN) and autoimmune diseases. The RHD gene encodes the D antigen that defines the RhD positive variant (D), while the RhD negative variant (d) is caused by the direct deletion of the gene in the vast majority of the cases [14, 15, 17]. The RhD negative phenotype is recessive and, thus, only shows when the two copies of the RHD gene in the individual are deleted. In the RHCE gene, different point mutations define the C/c and E/e antigen pairs [14, 17]. The actual function of the RH genes, beyond their role as antigens, has barely been studied. The available information has been inferred only from their homology with other genes in their family, which are involved in erythrocyte membrane structure and/or transport of ammonia or carbon dioxide [18]. A well-known fact is its association with the HDN. Although other Rh alleles and blood group systems can be related to the HDN, most of the cases are associated with the RhD negative variant. HDN takes place when the immune system of a RhD negative mother reacts producing anti-D antibodies after the exposure to red blood cells from a previous RhD positive pregnancy [19]. The HDN incidence in Europeans was 1 in 20 births to RhD negative mothers with a mortality around 20–40% before 1968, when Rho(D) immune globulin started to be used as preventive treatment for HDN [20, 21]. Given the association of the Rh negative allele with HDN, one would expect this allele to have been selected against (at least until quite recently) and thus to have been maintained at low frequencies or even disappear in populations. Note, though, as discussed below, that only heterozygotes are affected by HDN, and their death removes one copy of each allele, leading to an unstable equilibrium. Thus, the existence of the RHD deletion at polymorphic frequencies, and, in particular, its high frequency in Basques, needs to be explained. Both demographic and adaptive hypotheses have been posited including drift by isolation and heterozygous advantage, respectively, without clear conclusions. As for the advantage for heterozygous, an unknown positive effect may counteract the disadvantage of the association with HDN. However, complete understanding of the adaptive nature of the RHD and RHCE genes is complicated by the fact that their functions are hardly known.

Since Etcheverry detected for the first time a high frequency of the RhD negative variant in Basque emigrants to Argentina [22], many studies have analyzed this variant, confirming the high frequency of the RhD negative allele using different methods and datasets [1, 11,12,13]. However, most of these studies were based on classical serologic methods and focused on the description of the antigens without describing the full genetic diversity of the system. In the present study, we aim to analyze the genomic region around the RH genes in Basques and other populations in order to better understand the polymorphism of the system in the Basque population and contrast our results with previous immunological studies.

Materials and methods

Sample collection

DNA was extracted from blood samples collected from 53 Basque, 17 Moroccan, and 12 Catalan unrelated volunteers with their four grandparents born in the geographical region studied. All samples were collected with the appropriate informed consent and the project obtained the ethics approval from the local Institutional Review Board, Comitè Ètic d’Investigació Clínica–Institut Municipal d’Assistència Sanitària (CEIC-IMAS) in Spain (2013/5429/I). In order to compare the Rh data in our samples with other populations, we included previously published HapMap samples from 22 Yorubans (YRI), 21 Han Chinese (CHB) and 20 Central Europeans (CEU) analyzed for the Rh system [23]. The chimpanzee sequence mapped on the human genome GRCh37/hg19 from the UCSC database [24] was used as an outgroup for some of the analyses.

Genotyping and sequencing of the Rh region

RHCE SNPs rs676782 (C/c; polymorphism P103S) and rs609320 (E/e; polymorphism A226P) were genotyped as in Perry et al. [23], performing a co-amplification of the surrounding regions, with FAM and HEX dye-labeled primers, and then a digestion of the products with MnlI restriction enzyme, whereas the RHD deletion was determined through a TaqMan quantitative PCR assay performed as in Perry et al. [23]. Since the RhD negative variant is due to the deletion of the RHD gene and both RHD and RHCE present a high sequence identity, we have focused our analysis on the flaking regions of these genes. We sequenced two regions of 6 kilobases (kb) each, upstream and downstream of the RHD gene outside the Rhesus boxes (GRCh37/hg19 coordinates chr1:25578836-chr1:25585562 and chr1:25664732-chr1:25671293, respectively) as delimited by Perry et al. [23], and a region of 6 kb downstream the RHCE gene (chr1:25682780-chr1:25688684). We amplified these three regions following a Long Touchdown PCR protocol using the primers by Perry et al. [23]. and new designed primers for the downstream region of the RHCE gene (Table S1 and Figure S1). After the PCR amplification, the concentration of the products was normalized in order to prepare a library using the Illumina® Nextera XT DNA Library Preparation Kit before sequencing with Illumina MiSeq® with 2 × 250 cycles.

Mapping, variant calling, and phasing

Sequencing reads were mapped with the reference genome GRCh37/hg19 using the Burrows-Wheeler Aligner (BWA) software [25]. Mean sequencing coverage for each sample was calculated (Figure S2) in order to ascertain sequence quality. Then, SNP calling was performed with the Genome Analysis Tool Kit (GATK) v3.5 software package [26]. To define the Rh phenotype in our samples, we selected a position within the RHD gene (chr1:25627957) and created a virtual SNP to represent the lack (T) or presence (C) of the gene copy, referring to the RhD negative or RhD positive phenotype, respectively. Then, we merged our data with those from Perry et al. [23] and used the SHAPEIT v2.778 software [27, 28] to infer haplotypes. Generated data files are available in: https://figshare.com/s/e0fce29f846e741c601c. Samples and SNP genotyping information is available in Bioproject accession number: PRJNA473473. Sequence data can be found at GenBank accession numbers: MH404266-MH404429 for RHD upstream flanking region; MH404430-MH404569 for RHD downstream flanking region; and MH404570-MH404669 for RHCE downstream flanking region.

Sequence analysis

Summary statistics were calculated using DnaSP v5.10.1 (ref. [29]) and genetic structure measured by FST values were calculated by region using Arlequin v3.5.2.2. (ref. [30]). Patterns of linkage disequilibrium (LD) within populations were calculated with the Haploview v4.2 (ref. [31]). Each LD block analysis was performed by population, and all three Rh regions were analyzed and plotted together in order to detect LD between different regions. Blocks were estimated by using the confidence intervals algorithm defined by Gabriel et al. [32] and Hardy-Weinberg p-values and minimum minor allele frequency cut-offs were set at 0.0001 and 0.05, respectively. The virtual RHD SNP was excluded in the LD block analyses. In order to define the relationship of the inferred haplotypes over the regions among populations and Rh genotypes, median-joining networks were obtained for both RHDup and RHDdown regions, together and separately, by using the Fluxus Network v5 and Fluxus Network Publisher v2.1.1.2 programs [33]. FST values for the D/d, E/e, and C/c variants were calculated as recommended by Bhatia et al. [34] in order to avoid artifacts due to sample size differences, adapting the R script by Di Gaetano et al. [35]. To compare the Rh values to a genomic reference, genome-wide FST values of 74255 SNPs were estimated in the CEU, CHB and YRI populations from HapMap Phase III [36], Moroccans [37], Basques [38], and Catalans (unpublished data).

Results

RHD and RHCE phenotypes, genotypes, and haplotypes

Genotyping results and genotype frequencies of Basques and the other analyzed populations are available in Table S2. Estimated allele frequencies for the D/d, C/c, and E/e variants and expected phenotypes for the RHD and RHCE genes are shown in Fig. 1 and Table S3. All three polymorphisms were in Hardy–Weinberg equilibrium after Bonferroni correction in all samples. As previously reported, European populations present the highest estimated worldwide frequencies of the RHD deletion (d) [39]. Furthermore, our analyses show that the highest frequency of the RHD deletion is found in Basques (47.2%), in agreement with other analyses [12, 22]. However, the frequencies found in our Basque samples are not as extreme as shown in some previous studies, which were mainly based on antigen reactions. In particular, the frequency of the RHD deletion in Basques is just slightly higher than that found in Catalans (41.7%, p-value = 0.625). Regarding the functional polymorphisms of the RHCE gene, the frequency of the derived C allele is very low in sub-Saharan Africans, whereas the C/c alleles showed intermediate frequencies in the remaining populations, as first reported by Mourant in 1954 (ref [13]). Finally, the frequency of the derived E allele is low in most of the analyzed populations (Fig. 1a and Table S3). When we explored the haplotypes defined by the three RHD/RHCE polymorphisms (Fig. 1b), the most frequent RhD negative haplotype is dec [14, 39, 40], except in Moroccans where haplotype deC is more common. Among RhD positive haplotypes, the ancestral haplotype Dec [39, 40] is the most frequent in Africans, whereas in non-African populations haplotype DeC is found at higher frequencies [41]. The dEC haplotype was not found in our population dataset, and DEC was only found at low frequencies in the CHB population. These two haplotypes are known to be the most complex derived haplotypes since they have been suggested to be originated by recombination of the ancestral forms [14, 39, 40]. Regarding the haplotype frequencies found in Basques, these are concordant, though with small differences, with previous immunological studies. In particular, Basques present the highest frequency of the dec haplotype (41.5%) and the lowest frequencies of the deC (4.7%), dEc (0.9%), and Dec (4.7%) haplotypes among the European populations analyzed here [11, 12].

Fig. 1
figure 1

Frequency plots related to the three analyzed RHD/RHCE polymorphisms. a Frequencies of the variants within populations. b Frequencies of the expected phenotypes within populations. Bas Basque, Cat Catalans, CEU HapMap Europeans, CHB HapMap Chinese, Mor Moroccans, YRI HapMap Yorubans

RHD/RHCE sequence analysis

In the present study we sequenced three ~6 kb regions flanking the RHD and RHCE genes in Basques and Catalans: the RHD upstream (RHDup) region, the RHD downstream (RHDdown) region, and the RHCE downstream (RHCEdown) region. Additionally, we also sequenced the RHDup and the RHDdown regions in Moroccans and compiled available genotyping and sequencing data for the same two regions in CEU, CHB and YRI from Perry et al. [23]. A total of 47 SNPs were found in the RHDup region, 49 SNPs in the RHDdown and 13 in the RHCEdown (data available only for Basques and Catalans) (SNP information by individual is available in Table S4; inferred haplotypes of each flanking region in the studied populations are shown in Table S5). No indels were detected in any of the regions of our dataset. For the RHDup and the RHDdown regions, sub-Saharan Africans presented a higher number of SNPs as well as higher nucleotide and haplotype diversity compared to the rest of the populations (Table 1) in accordance with the out-of-Africa bottleneck. Both diversity indexes and the corresponding FST pairwise population comparisons place sub-Saharan Africans as the most differentiated population (Figure S3, Table S6).

Table 1 Summary table of the RH flanking regions by population

LD analysis sheds light on the forces leading the evolution of a genomic region as well as its geographic subdivision. In order to test if LD patterns could be related to the higher frequency of RhD negative in Basques, we estimated LD blocks by population around the genomic region analyzed. Patterns of LD in the RHD/RHCE region are similar in all three European populations analyzed, including Basques (Figure S4). Notably, an LD block in the RHDdown region is found in all populations analyzed. This block includes the SMP1/TMEM50A gene, located between both RH genes but apparently functionally unrelated to the Rh system. In Europeans, this block extends upstream and includes the RHDup region. The C/c and E/e alleles were not part of this LD block in all cases.

In order to determine the relationship between the haplotypes in the RHD/RHCE region, a network analysis was performed. Similar patterns were found in all obtained networks when analyzing both RHD regions together (Fig. 2) or separately (Figure S5). Mainly, they show more diversity in Africa and clear clusters of sequences can be distinguished according to their Rh phenotypes. Two main differentiated sections corresponding to RhD positive and RhD negative haplotypes are clearly observed, which suggests that the RHD deletion is the result of a major single mutational event [15]. There are other minor cases in the network that could be explained by artifacts in the haplotype reconstruction, mistyping, or recurrent events. The RhD negative section is mainly represented by Basques and is less diverse than the RhD positive section (Table 1). This lack of diversity might be expected since the RhD negative phenotypes derive from the ancestral RhD positive form [14, 15, 17]. The ancestral Dec haplotype is mainly distributed around the center of the network.

Fig. 2
figure 2

Network analysis of haplotypes from both RHDup and RHDdown flanking regions. The network was colored by population (a) and the three RHD/RHCE polymorphisms haplotypes (b). Network for the regions separately is shown in Supplementary materials (Figure S4)

Unexpectedly, haplotype analysis shows that non-sub-Saharan RhD positive haplotypes are associated with the derived C allele (with high frequencies of the DeC haplotype), while sub-Saharan RHD positive haplotypes are mostly found associated with the ancestral c allele (Fig. 1, Fig. 2, Figure S5 and Table S3). Moreover, variation at the RH flanking regions is quite conserved in the DeC haplotypes, most of them presenting similar sequences that determine four main nodes and several minor haplotypes, conforming a star-like structure in the network. Next, we computed FST to test whether population-specific positive selection [42, 43] could explain this haplotype pattern. To this end, the FST values for the C/c variant were compared to the genome-wide distribution in each population taking into account all haplotypes, as well as with the RhD positive and RhD negative considered independently (Fig. 3, Table S7). Our results not only confirmed the extreme FST values for the C/c variant when comparing YRI to CEU and CHB, respectively, as reported by Perry et al. [23], but also when comparing YRI to Basques, Catalans, and Moroccans. Notably, the FST values for the C/c variant were extreme, and fell in the 95 percentile of the genomic distribution of FST values in all population comparisons when considering only RhD positive haplotypes.

Fig. 3
figure 3

Unusual patterns of population differentiation for the C/c variant in non-African populations. Lines are the distribution of the genome-wide FST values calculated between YRI and the remaining populations included in our analysis. Dots represent the FST values of the C/c variant. Dotted line represents the 95 percentile. The analysis was performed for all our haplotypes together (a), and for RhD negative (b) and RhD positive (c), separately

Discussion

In our analyses of allele and haplotype frequencies for the RHD/RHCE variants in Basques, we obtained similar observations to those shown in previous immunological studies of the Rh system with no striking particular patterns of allele, haplotype or sequence diversity in Basques. Haplotypic composition and allele frequency are very close to other European populations, suggesting that their differentiation, at least based on Rh polymorphisms, is lower than usually expected [12, 22]. It has been shown that the lowest frequencies of RhD negative are found in Asia, in indigenous populations from America and Pacific Area, as well as in Africa. On the other hand, the reported frequencies are higher in Europe, especially in Basques [39]. The RhD negative frequency in our Basque samples (47.2%), despite being the highest among the analyzed populations, is not as extreme as usually suggested, but within the range reported in previous immunological studies around the Franco-Cantabrian region (45–54%) [1, 11,12,13, 22]. In fact, the frequency of the RhD negative allele is found at a 41.7% in our Catalan sample, and other studies also reported relative similar frequencies in other North Iberian surrounding populations [44]. The small differences we find in the frequency of RhD negative in Basques, as compared to previous analyses, could be due to a sampling bias or the differential accuracy in serological methods and DNA data. Thus, rather than isolated extreme values, high RhD negative frequencies in Basques seem to be the end-point of a cline encompassing also neighboring populations. We confirm in Basques the highest frequency of dec haplotype and lowest frequencies of DEC and dEc. However, we do not observe a lower frequency of DEc in Basques, compared to Western Europeans as reported by previous studies [11, 12]. Instead, this haplotype is showing frequencies similar to those found in other European populations [39]. Even though the origins of the higher frequency of RhD negative variant in Europe, principally in Basques, are still controversial, our results suggest that it is led by a major deletion event. Thus, a different origin or repeated, independent mutations cannot be invoked to explain the higher frequency of the RhD negative allele in Basques.

Demographic processes have probably been the main factor in the evolution of the Rh polymorphisms frequencies. Bottlenecks and low effective population sizes reduce the effectiveness of potential selection, and, in social contexts where family size is tightly controlled, new pregnancies could compensate the effect of neonatal HDN births [45]. Clear evidence of selection around the system has not been shown yet. Frequencies of RhD negative higher than those of RhD positive are not observed in any population. Moreover, the genotype with the highest frequency is the heterozygote in most of the cases, except in the populations with a very low frequency of RhD negative (Fig. 1, Tables S2 and S3). Hence, studying selection around the Rh system results much more complex than a simplistic analysis of directional selection against or in favor of RhD negative depending on the population. A heterozygote disadvantage scenario has been posited since the potential HDN cases are directly associated with heterozygote children (Dd), from RhD negative mothers (dd) and RhD positive fathers (DD or Dd). This would lead to an unstable equilibrium of the frequencies with a tendency towards the increment of the major allele and its homozygote in a specific population [21]. Otherwise, a balancing selection scenario has been also often suggested to explain the polymorphism and the high frequencies of RhD negative in populations, either by frequency-dependent selection or by some selective advantage linked to the RHD deletion that could overcome its association to the HDN. In the latter case, a possible heterozygote advantage by protection against Toxoplasma gondii has been suggested [21, 46, 47]. However, this kind of association has not been clearly demonstrated and analyzing balancing selection is a really complex issue [45,46,47,48,49]. Thus, not only demographic processes but maybe also different selective processes may have been related to the origin of the diverse frequencies of RhD negative/RhD positive in populations. Basques have been characterized by their attractive and unsolved history, probably determined by isolation and drift. Hence, reduced effective population size in a historically isolated population may have decreased the efficiency of the purifying selection associated with HDN, and some events may have generated specific selective pressures and set a different equilibrium state for the RHD deletion allele [23, 45, 46, 49].

Finally, our data confirmed an association between the derived RhC allele and the RhD positive variant, as well as extreme values of FST for the Rh C/c variant when comparing sub-Saharan and non-sub-Saharan populations being consistent with the unexpected result from Perry et al. [23]. In this study, the high population differentiation was suggested to be result from a possible local positive selection process around the RHCE region. However, they did not obtain solid evidence of signals of selection. Moreover, as reported by Gardner et al. [50], extreme values of FST in individual markers do not imply population-specify selection, and the selection signal should extend to the surrounding regions. Furthermore, the scarce information about the function of the RH genes makes it difficult to provide hypotheses for the adaptive value of RH. It is pivotal to increase the available information with further genetic studies, specially based on the functions of the RHD/RHCE system products and the correlation with genotypes, to better understand the evolution and population genetics of the Rh system in humans.