Introduction

Genetic studies investigating uniparental and fine-scale autosomal variation in Estonia [1] and in its neighbouring populations in NEE [2,3,4,5,6] observed that the regional genetic structure correlates closely with geography. In addition, recent ancient DNA studies have begun to uncover the settlement history of NEE, which is distinct from that of central and southern parts of the continent [7,8,9].

The four most common chrY haplogroups (hgs) with incidence above 5% (R1a-M198, N3-TAT, I-M170, R1b-M343) constitute over 90% of the chrY pool in NEE [3, 10,11,12]. Several studies have analysed these hgs in a wide phylogeographic context.

Besides the four most common hgs, several paternal lineages belonging on the basic level to hgs E2, J2, G and Q with frequency up to 5%, complement the pool of Y-chromosomes in NEE [3, 4, 13,14,15]. In Europe, hgs E2a, J2 and G are common in the southern Mediterranean populations and form 20–30% of their chrY lineages. In NEE, the frequency of hg E2a’d is ~2–3%, hgs J2 and G respectively reach ~1–2% and ~1% of the total pool of chrY lineages [3, 5, 6, 15, 16]. Hg Q has a frequency of 1–3% in most European populations with the highest incidence in Sweden [3, 4]. Hg Q is otherwise widely spread in Siberian populations and is among the major founding male lineages in the peopling of the Americas [17, 18]. These rare hgs that make up less than 10% of NEE male lineages, are mostly left unexplained and are often regarded as recent scattered entries into populations. The small sample sizes and low phylogenetic resolution has not allowed separation of rare lineages beyond the major hg labels. The sequencing of complete Y-chromosomes provides a way to resolve the inner structure of lineages on the phylogenetic tree regardless of their prevalence in populations [13, 14, 19, 20]. Sequencing a considerable number of well dispersed samples from NEE reveals the distribution of rare lineages on the entire phylogenetic tree and provides sufficiently granular data to estimate their split times. This builds the necessary geographic and chronological context for surveying patterns of uncovered lineage clusters stemming from a single node and hallmarking local expansions. The coalescence ages of ancestral internal nodes and phylogenetically well-defined clusters nested within disclose the geography and timeframe of local expansions as well as possible gene flow involving ancestral carriers of rare male lineages in Estonia, Sweden and their neighbouring populations.

Here we aim to analyse the previously understudied rare chrY lineages with a focus on Estonia and Sweden together with their NEE neighbours and Germany to account for the historic influence of the Baltic Germans. Additional populations are included to widen the geographic context. We combined full sequences of Y-chromosomes from populations inhabiting Estonia, Sweden, Finland, Latvia, Lithuania, Poland, Germany, Ukraine and the Russian Federation to build updated phylogenetic trees for haplogroups rare in NEE. In order to mitigate sampling bias that might influence any conclusions drawn from such a rare substratum present among the populations, we tested the representativeness of our two largest cohorts sampled from the Estonian and Swedish populations by comparing their frequency compositions with sample sets independently obtained from the same two populations.

Materials and methods

Samples

We screened the occurrence of rare hgs in a sample of 1160 chrY sequences from male donors (selected randomly by county of birth) from the population-based Estonian Biobank [21]. The Estonian chrY sequences are part of the whole genome sequencing (WGS) data set autosomally first described in Mitt et al. [22] for constructing a population-specifc imputation panel. Only chrY sequences of the haplogroups rare in NEE (N = 64) are included in the current study. Next, in scientific collaboration with the commercial genetic testing company Gene by Gene (Houston, Texas, USA), we screened the collection of customers who had provided informed consent for their data to be used in scientific inquiry. This resulted in a total of 2018 male donors with self-reported ancestry from Sweden, Finland, Latvia, Lithuania, Poland, Germany, Ukraine and the Russian Federation. If the database contained more than 500 samples from a respective country, individuals with identical self-reported paternal and maternal origin were preferably selected. In case of smaller available sample sets, all samples with self-reported paternal origin from the respective country were selected. From the resulting set of 2018 samples, we detected 222 Y-chromosomes belonging to the rare NEE haplogroups and these samples were included in the current study. We collected additional 139 chrY sequences from published sources resulting in the final set of 421 chrY sequences (Supplementary Table S1) used for reconstruction of phylogenetic trees for rare hgs E2 (129 samples), J2 (136 samples), Q (83 samples) and G (71 samples) (Figs. 1, 2 and Supplementary Figs. S1S7).

Fig. 1: Schematic phylogenetic trees of hg E2a and J2b.
figure 1

The calibrated trees were constructed using BEAST v.1.7.5 software package. Internal nodes, sub-clade names and population names (numbers show the number of samples) are indicated. Internal nodes with posterior probabilities <0.73 are not shown. Samples from Estonia and Sweden are marked in blue and orange, respectively. a A schematic phylogenetic tree of hg E2a is based on 132 high-coverage chrY sequences. Neighbour-clade E2b and its sublineages are marked in grey. Detailed tree can be found in Supplementary Materials (Supplementary Fig. S5). Age estimates can be found in Supplementary Table S8. All the subclade (node) defining mutations and marker names are presented in Supplementary Table S4. b A schematic phylogenetic tree of hg J2b is based on 136 high-coverage chrY sequences. Neighbour-clade J2a and its sublineages are marked in grey. Detailed tree can be found in Supplementary Materials (Supplementary Fig. S6). Age estimates can be found in Supplementary Table S9. All the sub-clade (node) defining mutations and marker names are presented in Supplementary Table S5.

Fig. 2: Detailed phylogenetic tree of hg Q.
figure 2

A detailed phylogenetic tree of hg Q-M242 is based on 84 high-coverage chrY sequences. Two hg R1a sequences were used for an outgroup. The detailed calibrated tree was constructed using BEAST v.1.7.5 software package. Internal nodes, sub-clade names and population names are indicated. Internal nodes with posterior probabilities <0.73 are not shown. Age estimates can be found in Supplementary Table S10. All the subclade (node) defining mutations and marker names are presented in Supplementary Table S6. Samples from Estonia and Sweden are marked in blue and orange, respectively.

To test for possible sampling bias in the two largest sequencing cohorts, we screened the haplogroup frequencies of two independent datasets – a total of 505 chrY sequences available from the SweGen project (samples specifically selected to be representative of the historic Swedish population [23]) and a randomly selected non-overlapping set of genotyped 7949 Estonian male donors from the Estonian Biobank.

Sequencing, mapping and genotyping

ChrY sequences from the Estonian Biobank and the SweGen project were generated with Illumina Inc. (Illumina, San Diego, CA, USA) using HiSeq instruments (PCR-free protocol) and targeted 30x genome-wide coverage. The personal genetic testing company dataset was generated using the proprietary BigY Illumina-based targeted chrY capture sequencing service (https://learn.familytreedna.com/wp-content/uploads/2014/08/BIG_Y_WhitePager.pdf).

We used the same processing pipeline for all Illumina data. Fastq files were mapped with BWA-MEM (v0.7.12) [24] on the human reference hs37d5 (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence). Read duplicates were removed with Picard (v2.12.0) (http://broadinstitute.github.io/picard/) and remaining unique reads realigned around known indels, followed by base quality score recalibration (BQSR) using GATK (v3.8) [25]. Variant calling was performed with GATK tool HaplotypeCaller in haploid mode. All-sites VCF files were filtered with bcftools (v1.9) [26]. The Illumina data and previously filtered data from Complete Genomics (Supplementary Table S1) were merged with CombineVariants from GATK (v3.8) [25]. We extracted the effective overlap between the two datasets by masking out all positions with 5% or higher proportion of missing genotypes in either Illumina or Complete Genomics datasets. We additionally excluded regions with poor mappability as described previously [13] resulting in a total of 9.7 Mb of analysed sequence. Within this sequence, the resulting numbers of variant positions used for phylogenetic reconstruction in each haplogroup are given in Supplementary Tables S4S7.

Haplogroup assignment

We assigned chrY haplogroups using yHaplo [27] for the Illumina capture and WGS data. We used SNAPPY [28] for chrY haplogroup assignment of the genome-wide array genotyping data.

Comparisons with an independent Estonian cohort

To validate the representativeness of sequenced Estonian chrY samples (N = 1160), we compared the hg frequencies of this cohort against a ~7 times larger cohort of 7949 Estonian male samples genotyped with the Illumina Infinium Global Screening Array v2 (Illumina, San Diego, CA, USA) containing 6638 Y-specific single nucleotide variants (SNVs). To do this, we first assessed the accuracy of haplogroup assignments obtained from this particular set of SNVs. We sub-sampled the 6638 array-specific Y-SNVs from the Estonian WGS data and used SNAPPY software to determine the haplogroups from the extracted set of SNVs. We compared the results against those from the software yHaplo [27]. The latter utilises the full set of SNPs in the WGS samples. The results are identical on the highest level of the major branches and only differ slightly at the finest resolution due to the lower number of array-genotyped SNPs available to SNAPPY for detecting the haplogroups. However, this shows that hg assignments based on the 6638 array-specific Y-positions are accurate enough to be compared to hg assignments based on full sequencing. The comparison of the hg frequencies of the WGS-based and array-based Estonian datasets was performed using a Wilcoxon signed rank test with continuity correction. We only used array-based data for comparing haplogroup frequencies between two independent cohorts. For the phylogeny reconstruction and phylogeographic analysis full sequencing data were used.

Phylogeny reconstruction of rare paternal haplogroups

We reconstructed phylogenies and estimated the coalescent times with the software package BEAST v.1.7.5 [29]. We used a Bayesian skyline coalescent tree prior, the general time reversible (GTR) substitution model with gamma-distributed rates and a relaxed lognormal clock. The run was performed with the piecewise-constant coalescent model. The mutation rate used was 0.74 × 10−9 (95% CI: 0.63–0.95 × 10−9) per base pair per year [13]. The results were visualised and checked for effective sample size above 200 in Tracer v.1.4. Coalescence time estimates were computed with normally distributed age priors with 10% standard deviation from previously published phylogeny [13] and are in Supplementary Table S3. Lineages from hg R and I were used as outgroups for hgs Q and G, respectively. Each run had thirty million chains logged every 3000 steps and 10% discarded as burn-in. Two parallels with different random number seeds were combined with LogCombiner.

The manually annotated phylogenetic trees, mutation lists and coalescence age estimates are available in Supplementary Figs. S1S4 and Supplementary Tables S4S11. This study’s updated nomenclature follows the criteria set in Karmin et al. [13].

Bayesian phylogeographic analysis

To illustrate the potential direction of influx of the primarily Estonian subclades in hgs E2a1-CTS1273 and J2b2-L283, we performed Bayesian phylogeographic analyses in continuous space. For this we used available geographic coordinates for 59 sequences belonging to hg E2a1-CTS1273 and for 41 sequences belonging to hg J2b2-L283. This method has been originally developed and successfully used to reveal the ancestral location and spatial dynamics of viruses in continuous space [30, 31]. We conducted the analysis according to the publication exploring the history of Y-chromosomal hg J1 [32] in BEAST v1.10.4 [33] using BEAGLE library v3.1.2 [34] for accelerated likelihood evaluation. This statistically robust and absolutely data-driven method uses molecular sequence data and geographic coordinates of the samples to infer phylogeography in a continuous landscape while simultaneously reconstructing the evolutionary history in time. It draws the confidence area of ancestral locations where the root and internal nodes originated together with the directions and the speed of the diffusion (Fig. 3). The uncertainties of the maximum clade credibility tree node locations were visualised with SpreaD3 v0.9.7.1rc software [35]. This inference approach accounts for the coalescent, phylogenetic, molecular clock, location, and other uncertainties within a single framework. Additional details are provided in Supplementary Note 1.

Fig. 3: Phylogeographic spread maps of hgs J2b2-L283 and E2a1-CTS1273 in Europe.
figure 3

Maps indicate the phylogeographic spread of a J2b2-L283 around 6 kya, 4 kya, 2 kya and in the present, and b E2a1-CTS1273 around 5–6 kya, 4 kya, 2 kya and in the present. Shaded in pink are the 80% HPD areas of the node locations inferred by Bayesian continuous phylogeographic analysis with Beast v1.10.4 software. White circles indicate the median locations of the nodes, while black lines indicate the branches of the maximum clade credibility tree.

Results

Phylogeny of rare lineages in NEE

The studied 1160 high-coverage sequences of Y-chromosomes from Estonia disclose 64 samples carrying male lineages rare in NEE (frequency of each under 3%), amounting to ~6% of the total paternal lineage pool in Estonia. The most frequent minor lineage in Estonia belongs to hg E2 (2.5%), followed by hgs J2 (1.9%) and hg G (0.9%), whereas hg Q is the rarest (0.3%) (Supplementary Table S2). Our second largest sample set consists of a total of 746 males from Sweden and discloses 78 samples with rare NEE chrY lineages. The most common minor haplogroup in the Swedish cohort is hg Q (4.6%); followed by hgs G (3%), E2 (1.7%) and J2 (1.2%) (Supplementary Table S2).

To verify the robustness of our frequency estimates, we compared hg frequencies of our Swedish sample set and the SweGen cohort (N = 505) [23]. The Wilcoxon signed rank test showed no statistically significant differences between the two, either considering all hgs (p value = 0.4689) or minor hgs with major hgs collapsed (p value = 0.6602). Similarly, a comparison of hg frequencies between the Estonian sample set and an independent non-overlapping set of 7949 genotyped male samples from the Estonian Biobank yielded no statistically significant differences in their hg composition, either considering all haplogroups (Wilcoxon signed rank test p value = 0.4896) or rare hgs with major hgs collapsed (Wilcoxon signed rank test p value = 0.9219).

Hg E originated in Africa with its sublineage E1 distributed solely on the African continent, whereas the neighbour-lineage hg E2 displays a notably wider distribution. Subclade E2-V13 is common (~10–20%) among south-eastern European populations [4, 6, 14, 16], falling to 10% in Anatolia and the Middle East [36] and declining towards northern Europe to 1–2% in Scandinavia [4].

Here we reconstruct the phylogeny of hg E2a’b’c’d-M35. Its subclade E2a-M78 is largely confined to Europe with a coalescence time of ~14 kya (95% CI: 10,432–18,566) (Fig. 1a, Supplementary Table S8). Within this subclade, L618 marker unites almost all European samples that split ~13 kya (95% CI: 9,682–17,360) from the neighbouring clade E2a2-V22. The latter consists primarily of samples from the Middle East with deeper diversification times (Supplementary Fig. S5). The absolute majority (25/29) of hg E samples from Estonia belong to subclade E2a1-V13 (Supplementary Table S2). The bulk of Estonian samples form clearly distinguishable clusters: lineage E2a1-S7461 contains an Estonian founding cluster that splits from the neighbour lineages with Swedish and Middle Eastern origin ~4 kya (95% CI: 3,146–5,752, Supplementary Table S8) and a radiation time of ~2 kya (95% CI: 1,398–2,999; Supplementary Table S8). Similar pattern can be seen in the hg E2a1-B409 that has lineages from Germany and Sweden and an exclusively Estonian cluster defined by marker Z37869 with a radiation time of ~2 kya (95% CI: 1,150–2,428; Supplementary Table S8).

Hg J is one of the most common haplogroups in Western Asia and in regions surrounding the Mediterranean Sea and thus was initially connected to the dispersion of male farmers from the Fertile Crescent. Phylogenetic studies of hg J have shown surviving ancient sublineages with radiation signs in the Bronze Age [37, 38]. Additionally, hg J2a and an unresolved hg J have been discovered in ancient DNA from hunter-gatherer samples excavated in the Caucasus [39] and Karelia [40]. In southern Europe, the most common hg J subclade is J2-M172, which, however, becomes rare throughout the northern latitudes [4, 16].

Here we reconstruct the phylogenetic tree of hg J2-M172 (Fig. 1b and Supplementary Fig. S6) with 134 individuals. A substantial part of NEE individuals belong to sublineages within hg J2b2-L283 (Fig. 1b) which splits from its neighbouring clade at ~16 kya (95% CI: 11,860–20,018; Supplementary Table S9). Hg J2b2-L283 itself split ~7 kya (95% CI: 5,000–8,912) into two major sublineages J2b2-Z2505 and J2b2-YP29. The latter is an exclusively Estonian cluster encompassing over half of all hg J samples from Estonia (12 of 22) with an expansion time of ~2 kya (95% CI: 1,446–3,027) (Fig. 1b, Supplementary Fig. S6, Supplementary Table S9).

The other major hg J subbranch – J2a-M410 – contains samples from broad Eurasian background which are distributed in subclades mostly coalescing during the early post-Last Glacial Maximum – a much deeper time estimate than in the neighbouring hg J2b-M12 phylogeny (Supplementary Fig. S6). Lack of information on detailed geographic or ethnic origin hinders any further conclusions regarding the single-origin clusters from the Russian Federation (Supplementary Fig. S6). Based on published research, lineages of hg J2a-M67 are among the most common (~20%) paternal haplogroups of the North Caucasus region [41], whereas in ethnic Russians this haplogroup amounts to less than 2% [5, 6].

Hg Q is frequent in Siberian populations and is carried by over 85% of male Native Americans [16,17,18, 42]. In Europe, the occurrence of hg Q is uneven and the general frequency is low (~0.42%) [42], but hg Q is somewhat more frequent in the populations of Sweden and Norway [3, 4]. It is the most numerous minor haplogroup in both of our Swedish sample sets with frequencies of 2.6% and 4.6% (Supplementary Table S2). In the datasets of Karlsson et al. [4] and Lappalainen et al. [3] the frequency of hg Q fluctuates between 1% and 5% in different regions of Sweden. On the updated phylogenetic tree, Swedish samples fall into two main clusters that separated from each other around the peak of the Last Glacial Maximum ~20 kya (Fig. 2). About a third of the Swedish hg Q samples are defined by marker L804. Hg Q1a-L804 coalesces ~16 kya (95% CI: 12,456–19,874; Supplementary Table S10) with haplogroup Q1a-M3, which today describes the overwhelming majority of Native American Y-chromosomes [42]. The rapid diversification among Swedes in the L804-defined clade began ~3 kya (95% CI: 1,961–3,917; Supplementary Table S10).

Haplogroup G-M201 is common in the Caucasus and the Middle East. Hg G is one of the most prevalent male lineages in Sardinia and Corsica, but displays low frequencies elsewhere in Europe [4, 14, 15]. Hg G splits into two basal lineages – hgs G1 and G2, of which the former occurs infrequently in Western and Central Asia and is almost absent in Europe [15]. Almost all hg G samples from NEE belong to hg G2-P287 that ~22 kya (95% CI: 17,620–26,973) split into two main subclades – G2a-P15 and G2b-M377 (Supplementary Fig. S7, Supplementary Table S11). The bulk of sampled European individuals belong to subclade G2a2-P303 (Supplementary Table S2). Downstream, in hg G2a2-Z727, the absolute majority of Swedish hg G samples forms localised clusters with a variety of coalescence times (Supplementary Fig. S7, Supplementary Table S11).

Discussion

In case of Estonia, our sequenced samples were collected across the country avoiding large settlements with recorded extensive migration history. Considering a census size of roughly 1 million, rare lineages amount to roughly 30,000 men evenly sampled across the country and thus cannot be exclusively ascribed to any random influx of recent migrants.

From the screened sample of 506 Finnish males we did not detect any rare NEE lineages as almost all Finnish samples belong to hgs common among neighbouring populations – a probable reflection of either differing migration history or of demographic bottleneck(s) that have affected the Finnish population [43, 44].

Hg E sublineages have been associated with Neolithic demic diffusion into Europe [16], but current ancient DNA data has shown this haplogroup to be uncommon among the first agriculturalists in Europe [40]. In the resolved phylogenetic tree, the primarily Middle Eastern neighbouring clade with deeply diverged lineages supports a possible Levantine source of the European hg E2a1-V13. However, the split time predates the Neolithic transition in Europe and matches better with the age of the Villabruna hunter-gatherer cluster that displays earliest autosomal affinities to the Middle Eastern populations detected in ancient samples from Europe [45]. The coalescence age of the primarily European clades of hgs E3a1-V13 and J2b2-Z2505 underpins mid-Holocene as the starting point of chrY variation growth in Western Europe (Fig. 1) and indicates a possible influx of male lineages from the Levant or the Caucasus.

The coalescence ages of Estonia-specific clades J2b2-YP29, E2a1-Z37869 and E2a1-Y28220 broadly correspond to the Late Bronze Age and Iron Age period in Northern Europe (Supplementary Fig. S8). Additional sampling might certainly affect the coalescence age of these clusters. However, the geographical spread across all Estonian counties and current age estimates suggest that these expansions are not the result of any migratory events from the recent recorded (last ~800 years) history of this region. To infer the potential directions of influx of the clades J2b2-YP29, E2a1-Z37869, and E2a1-Y28220, we conducted continuous Bayesian phylogeographic analysis of parent hgs J2b2-L283 and E2a1-CTS1273. The estimated diffusion rate of hg J2b2-L283 equals 0.27 (95% HPDs: 0.1992–0.3478) and for hg E2a1-CTS1273 0.231 (95% HPDs: 0.175–0.295) kilometres/year. The 80% HPD of the putative geographic centre of diffusion for the hg J2b2 covers the area focused in present-day Poland, with a partial covering of central and southeastern Europe, spreading further north and south (Fig. 3a). The area for hg E2a1 ancestral location similarly covers central and eastern Europe with a focus on Poland (Fig. 3b), but the focal point appears to be more condensed.

From a conservative standpoint, all three subclades most probably arrived to present-day Estonia from the direction of central Europe. However, based on currently available data, it is not possible to say whether the evident local expansions initially began in Estonia or were the carriers already sufficiently diversified on arrival.

Within hg Q, clusters defined by L804 and Y4838 capture almost all of Swedish hg Q diversity, marking these lineages as an inherent, albeit scarce, part of the pool of male lineages in Sweden. The scarcity of internal nodes on the branches leading to the two now predominantly Swedish clusters hinders any discussion regarding a potential direction of influx or ancestral centre of diffusion. Due to the glacial coverage, the split between lineages Q1a-L804 and the Native American Q1a-M3 could not have happened in Scandinavia. Ancient DNA research confirms the presence of hg Q in the remains of hunter-gatherers (~8 kya) from Latvia and Lower Volga Region in Russia [46]. Today, European Q1 lineages are restricted to NEE with occasional findings in other populations (single L804 derived English chrY sample in Grugni et al. [47]). Precursors to current European hg Q1 sublineages could have been widely present in North Eurasia during the Last Glacial Maximum and followed a primarily northern (Siberian) route of dispersal into Europe. The presumptively common ancient gene pool is reflected in the autosomal European affinity of 24,000-old Mal’ta sample from the vicinity of Lake Baikal [48]. Alternatively, the prevalence of hg Q in Sweden could testify of a more recent Siberian influence deduced both from modern and ancient DNA analysis in northeastern Europe [2, 8]. Studies have demonstrated minor eastern affinities in the autosomes and in the maternal lineages of the modern Saami, but small sample sizes have not revealed Saami male lineages belonging to hg Q [2]. Further sampling across Northern Eurasia might provide additional insights about these peculiar North Eurasian hg Q lineages. A total of two out of the three Estonian hg Q samples form a subset of the Swedish Y4838-defined cluster. It is most parsimonious to assume that the paternal ancestors of the two Y4838-derived individuals arrived in Estonia around 1–2 kya from Scandinavia.

Hg G2a has become firmly associated with the early Neolithic farmers of Europe [40, 46, 49]. Most of European hg G2a inner lineages started to diverge around 5–7 kya (Supplementary Fig. S7, Supplementary Table S11) – within the timeframe of the European agricultural transition. In Sweden, it is the second most frequent minor chrY haplogroup. The majority of Swedish carriers demonstrate a strong expansion signal approximated to ~1 kya (nodes 52 and 60 in Supplementary Fig. S7), whereas Estonian samples are not part of the Swedish hg G2a diversity.

In conclusion, we demonstrate that in NEE, rare paternal lineages are not just single lineages scattered across different subclades in the phylogeny. We identified several population-specific clusters among less common haplogroups, which testify of radiation events that have occurred in various timeframes and can be used to tentatively suggest possible influx directions.

This study demonstrates the power of large-scale re-sequencing of Y-chromosomes to explore and compare the male demographic history of single populations. Current survey of rare lineages paves the way for future research involving large datasets of re-sequenced genomes with a focus on those maternal and paternal lineages that have left a major demographic impact on modern populations in NEE and elsewhere.