Language-tree divergence times support the Anatolian theory of Indo-European origin

Gray, Russell D.; Atkinson, Quentin D.

doi:10.1038/nature02029

Letter
Published: 27 November 2003

Language-tree divergence times support the Anatolian theory of Indo-European origin

Russell D. Gray¹ &
Quentin D. Atkinson¹

Nature volume 426, pages 435–439 (2003)Cite this article

10k Accesses
491 Citations
107 Altmetric
Metrics details

Abstract

Languages, like genes, provide vital clues about human history^1,2. The origin of the Indo-European language family is “the most intensively studied, yet still most recalcitrant, problem of historical linguistics”³. Numerous genetic studies of Indo-European origins have also produced inconclusive results^4,5,6. Here we analyse linguistic data using computational methods derived from evolutionary biology. We test two theories of Indo-European origin: the ‘Kurgan expansion’ and the ‘Anatolian farming’ hypotheses. The Kurgan theory centres on possible archaeological evidence for an expansion into Europe and the Near East by Kurgan horsemen beginning in the sixth millennium BP^7,8. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculture from Anatolia around 8,000–9,500 years bp⁹. In striking agreement with the Anatolian hypothesis, our analysis of a matrix of 87 languages with 2,449 lexical items produced an estimated age range for the initial Indo-European divergence of between 7,800 and 9,800 years bp. These results were robust to changes in coding procedures, calibration points, rooting of the trees and priors in the bayesian analysis.

You have full access to this article via your institution.

Download PDF

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

Hybrid speciation driven by multilocus introgression of ecological traits

Article Open access 17 April 2024

Network of large pedigrees reveals social practices of Avar communities

Article Open access 24 April 2024

Main

Historical linguists traditionally use the ‘comparative method’ to construct language family trees from discrete lexical, morphological and phonological data. Unfortunately, although the comparative method can provide a relative chronology, it cannot provide absolute date estimates. An alternative method of analysis is glottochronology. This derivative of lexicostatistics is a distance-based approach to language-tree construction that enables absolute dates to be estimated¹⁰. Glottochronology uses the percentage of shared ‘cognates’ between languages to calculate divergence times by assuming a constant rate of lexical replacement or ‘glottoclock’. Cognates are words inferred to have a common historical origin because of systematic sound correspondences and clear similarities in form and meaning. Despite some initial enthusiasm, the method has been heavily criticized and is now largely discredited^11,12. Criticisms of glottochronology, and distance-based methods in general, tend to fall into four main categories: first, by summarizing cognate data into percentage scores, much of the information in the discrete character data is lost, greatly reducing the power of the method to reconstruct evolutionary history accurately¹³; second, the clustering methods used tend to produce inaccurate trees when lineages evolve at different rates, grouping together languages that evolve slowly rather than languages that share a recent common ancestor^12,14; third, substantial borrowing of lexical items between languages makes tree-based methods inappropriate; and fourth, the assumption of a strict glottoclock rarely holds, making date estimates unreliable¹¹. For these reasons, historical linguists have generally abandoned efforts to estimate absolute ages. Dixon¹⁵ epitomizes this view with his assertion that, on the basis of linguistic data, the age of Indo-European “could be anything—4,000 years bp or 40,000 years bp are both perfectly possible (as is any date in between)”.

Recent advances in computational phylogenetic methods, however, provide possible solutions to the four main problems faced by glottochronology. First, the problem of information loss that comes from converting discrete characters into distances can be overcome by analysing the discrete characters themselves to find the optimal tree(s). Second, the accuracy of tree topology and branch-length estimation can be improved by using models of evolution. Maximum-likelihood methods generally outperform distance and parsimony approaches in situations where there are unequal rates of change¹⁴. Moreover, uncertainty in the estimation of tree topology, branch lengths and parameters of the evolutionary model can be estimated using bayesian Markov chain Monte Carlo¹⁶ (MCMC) methods in which the frequency distribution of the sample approximates the posterior probability distribution of the trees¹⁷. All subsequent analyses can then incorporate this uncertainty. Third, lexical items that are obvious borrowings can be removed from the analysis, and computational methods such as split decomposition¹⁸, which do not force the data to fit a tree model, can be used to check for non-tree-like signals in the data. Finally, the assumption of a strict clock can be relaxed by using rate-smoothing algorithms to model rate variation across the tree. The penalized-likelihood¹⁹ model allows rate variation between lineages while incorporating a ‘roughness penalty’ that penalizes changes in rate from branch to branch. This smoothes inferred rate variation across the tree so that the age of any node can be estimated even under conditions of rate heterogeneity.

We applied likelihood models of lexical evolution, bayesian inference of phylogeny and rate-smoothing algorithms to a matrix of 87 Indo-European languages with 2,449 cognate sets coded as discrete binary characters. This coding was based on the Indo-European database of Dyen et al.²⁰, with the addition of three extinct languages. Examining subsets of languages using split decomposition revealed a strong tree-like signal in the data, and a preliminary parsimony analysis produced a consistency index of 0.48 and a retention index of 0.76, well above what would be expected from biological data sets of a similar size²¹. The consensus tree from an initial analysis is shown in Fig. 1a. The topology of the tree is consistent with the traditional Indo-European language groups²². All of these groups are monophyletic and supported by high posterior probability values. Recent parsimony and compatibility analyses have also supported these groupings, as well as a Romano-Germano-Celtic supergroup, the early divergence of Greek and Armenian lineages²³, and the basal position of Tocharian²⁴. The consensus tree also reflects traditional uncertainties in the relationships between the major Indo-European language groups. For instance, historical linguists have not resolved the position of the Albanian group and our results clearly reflect this uncertainty (the posterior probability of the Albanian/Indo-Iranian group is only 0.36).

**Figure 1: Consensus tree and divergence-time estimates.**

One important advantage of the bayesian MCMC approach is that any inferences are not contingent on a specific tree topology. Trees are sampled in proportion to their posterior probability, providing a direct measure of uncertainty in the tree topology and branch-length estimates. By estimating divergence times across the MCMC sample distribution of trees, we can explicitly account for variability in the age estimates due to phylogenetic uncertainty, and hence calculate a confidence interval for the age of any node. We estimated divergence times by constraining the age of 14 nodes on each tree in accordance with historically attested events (see Supplementary Information). We then used penalized-likelihood rate smoothing to calculate divergence times without the assumption of rate constancy¹⁹. Another advantage of the bayesian framework is that prior knowledge of language relationships can be incorporated into the analysis. To ensure that the sample was consistent with well-established linguistic relationships, we filtered the 10,000-tree sample using a constraint tree (Fig. 1b). We used the resulting distribution of 3,500 estimates of basal divergence times to create a confidence interval for the age of the Indo-European language family (Fig. 1b).

A key part of any bayesian phylogenetic analysis is an assessment of the robustness of the inferences. One important potential cause of error is cognacy judgements. In the initial analysis, we included all cognate sets in the Dyen et al. database²⁰ in an effort to maximize phylogenetic signal. To assess the impact of different levels of stringency in the cognacy judgements, we repeated the analysis after removing all cognate sets identified by Dyen et al. as ‘doubtful’. ‘Doubtful cognates’ (for instance, possible chance similarities) could falsely increase similarities between languages and thus lead to an underestimate of the divergence times. Unrecognized borrowing between closely related languages would have a similar effect. Conversely, borrowing between distantly related languages will falsely inflate branch lengths at the base of the tree and thus increase divergence-time estimates. With the doubtful cognates removed, the conservative coding led to a similar estimate of Indo-European language relationships to that produced using the original coding. The relationships within each of the 11 main groups were unchanged. Only the placement of the weakly supported basal branches differed (Fig. 1c). More significantly, the divergence-time estimates increased, suggesting that the effects of chance similarities and unrecognized borrowings between closely related languages might have outweighed those of borrowings between distantly related languages. In other words, our initial analysis is likely to have underestimated the age of Indo-European.

The constraint tree used to filter the MCMC sample of trees also contained assumptions about Indo-European history that might have biased the results. We therefore repeated the analyses using a more relaxed set of constraints (Fig. 1d). This produced a divergence-time distribution and consensus tree almost identical to the original sample distribution (Fig. 1d).

Another potential bias lay in the initial coding procedure, which made no allowance for missing cognate information. The languages at the base of the tree (Hittite, Tocharian A and Tocharian B) may appear to lack cognates found in other languages because our knowledge of these extinct languages is limited to reconstructions from ancient texts. This uneven sampling might have increased basal branch lengths and thus inflated estimates of divergence times. We tested this possibility by recoding apparently absent cognates as uncertainties (absent or present) and re-running the analyses. Although divergence-time estimates decreased slightly, the effect was only small (Fig. 1e).

Finally, although there is considerable support for Hittite (an extinct Anatolian language) as the most appropriate root for Indo-European^22,23, rooting the tree with Hittite could be claimed to bias the analysis in favour of the Anatolian hypothesis. We thus re-ran the analysis using the consensus tree in Fig. 1 rooted with Balto-Slavic, Greek and Indo-Iranian as outgroups. This increased the estimated divergence time from 8,700 years bp to 9,600, 9,400 and 10,100 years bp, respectively.

The pattern and timing of expansion suggested by the four analyses in Fig. 1 is consistent with the Anatolian farming theory of Indo-European origin. Radiocarbon analysis of the earliest Neolithic sites across Europe suggests that agriculture arrived in Greece at some time during the ninth millennium bp and had reached as far as Scotland by 5,500 years bp²⁵. Figure 1 shows the Hittite lineage diverging from Proto-Indo-European around 8,700 years bp, perhaps reflecting the initial migration out of Anatolia. Tocharian, and the Greco-Armenian lineages are shown as distinct by 7,000 years bp, with all other major groups formed by 5,000 years bp. This scenario is consistent with recent genetic studies supporting a Neolithic, Near Eastern contribution to the European gene pool^4,6. The consensus tree also shows evidence of a period of rapid divergence giving rise to the Italic, Celtic, Balto-Slavic and perhaps Indo-Iranian families that is intriguingly close to the time suggested for a possible Kurgan expansion. Thus, as observed by Cavalli-Sforza et al.²⁶, these hypotheses need not be mutually exclusive.

Phylogenetic methods have revolutionized evolutionary biology over the past 20 years and are now starting to take hold in other areas of historical inference^{2,23,24,27,28,29}. The model-based bayesian framework used in this paper offers several advantages over previous applications of computational methods to language phylogenies. This approach allowed us to: identify sections in the language tree that were poorly supported; explicitly incorporate this uncertainty in tree typology and branch-length estimates in our analysis; test the possible effects of borrowing, chance similarities and bayesian priors on our analysis; and estimate divergence times without the assumption of a strict glottoclock. The challenge of making accurate inferences about human history is an extremely demanding one, requiring the integration of archaeological, genetic, cultural and linguistic data. The combination of computational phylogenetic methods and lexical data to test archaeological hypotheses is a step forward in this challenging and fascinating task.

Methods

Data and coding

Data were sourced from the comparative Indo-European database created by Dyen et al.²⁰. The database records word forms and cognacy judgements in 95 languages across the 200 items in the Swadesh word list. This list consists of items of basic vocabulary such as pronouns, numerals and body parts that are known to be relatively resistant to borrowing. For example, although English is a Germanic language, it has borrowed around 50% of its total lexicon from French and Latin. However, only about 5% of English entries in the Swadesh 200-word list are clear Romance language borrowings¹. Where borrowings were obvious, Dyen et al. did not score them as cognate and thus they were excluded from our analysis; 11 of the speech varieties that were not coded by Dyen et al. were also excluded. To facilitate reconstruction of some of the oldest language relationships, we added three extinct Indo-European languages, thought to fit near the base of the tree (Hittite, Tocharian A and Tocharian B). Word form and cognacy judgements for all three languages were made on the basis of multiple sources to ensure reliability. The presence or absence of words from each cognate set was coded as ‘1’ or ‘0’, respectively, to produce a binary matrix of 2,449 cognates in 87 languages.

Tree construction

Language trees were constructed using a ‘restriction site’ model of evolution that allows unequal character-state frequencies and gamma-distributed character-specific rate heterogeneity (MrBayes version 2.01; ref. 30). We used default ‘flat’ priors for the rate matrix, branch lengths, gamma shape parameter and site-specific rates. The results were found to be robust to changes in these priors. For example, repeating the analyses with an exponential branch-length prior produced a 95% confidence interval for the basal divergence time of between 7,100 and 9,200 years BP.

The program was run ten times using four concurrent Markov chains. Each run generated 1,300,000 trees from a random starting phylogeny. On the basis of an autocorrelation analysis, only every 10,000th tree was sampled to ensure that consecutive samples were independent. A ‘burn-in’ period of 300,000 trees for each run was used to avoid sampling trees before the run had reached convergence. Log-likelihood plots and an examination of the post-burn-in tree topologies showed that the runs had indeed reached convergence by this time. For each analysis a total of 1,000 trees were sampled and rooted with Hittite. The branch between Hittite and the rest of the tree was split at the root such that half its length was assigned to the Hittite branch and half to the remainder of the tree; divergence-time estimates were found to be robust to threefold alterations of this allocation.

Divergence-time estimates

Eleven nodes corresponding to the points of initial divergence in all of the major language subfamilies were given minimum and/or maximum ages on the basis of known historical information (see Supplementary Information). The ages of all terminal nodes on the tree, representing languages spoken today, were set to zero by default. Hittite and the Tocharic languages were constrained in accordance with estimated ages of the source texts. Relatively broad date ranges were chosen to avoid making disputable, a priori assumptions about Indo-European history. A likelihood ratio test with the extinct languages removed revealed that rates were significantly non-clock-like (χ² = 787.3, d.f. = 82, P < 0.001). Divergence-time estimates were thus made using the semi-parametric, penalized-likelihood model of rate variation implemented in R8s (version 1.50)¹⁹. The cross-validation procedure was applied to the majority-rule consensus tree (Fig. 1) to determine the optimal value of the rate-smoothing parameter. Step-by-step removal of each of the 14 age constraints on the consensus tree revealed that divergence-time estimates were robust to calibration errors. For 13 nodes, the reconstructed age was within 390 years of the original constraint range. Only the reconstructed age for Hittite showed an appreciable variation from the constraint range. This may be attributable to the effect of missing data associated with extinct languages. Reconstructed ages at the base of the tree ranged from 10,400 years bp with the removal of the Hittite age constraint, to 8,500 years bp with the removal of the Iranian group age constraint.

References

Pagel, M. in Time Depth in Historical Linguistics (eds Renfrew, C., McMahon, A. & Trask, L.) 189–207 (The McDonald Institute for Archaeological Research, Cambridge, UK, 2000)
Google Scholar
Gray, R. D. & Jordan, F. M. Language trees support the express-train sequence of Austronesian expansion. Nature 405, 1052–1055 (2000)
Article ADS CAS Google Scholar
Diamond, J. & Bellwood, P. Farmers and their languages: the first expansions. Science 300, 597–603 (2003)
Article ADS CAS Google Scholar
Richards, M. et al. Tracing European founder lineage in the Near Eastern mtDNA pool. Am. J. Hum. Genet. 67, 1251–1276 (2000)
Article CAS Google Scholar
Semoni, O. et al. The genetic legacy of Paleolithic Homo sapiens in extant Europeans: a Y chromosome perspective. Science 290, 1155–1159 (2000)
Article ADS Google Scholar
Chikhi, L., Nichols, R. A., Barbujani, G. & Beaumont, M. A. Y genetic data support the Neolithic Demic Diffusion Model. Proc. Natl Acad. Sci. USA 99, 11008–11013 (2002)
Article ADS CAS Google Scholar
Gimbutas, M. The beginning of the Bronze Age in Europe and the Indo-Europeans 3500–2500 B.C. J. Indo-Eur. Stud. 1, 163–214 (1973)
Google Scholar
Mallory, J. P. Search of the Indo-Europeans: Languages, Archaeology and Myth (Thames & Hudson, London, 1989)
Google Scholar
Renfrew, C. in Time Depth in Historical Linguistics (eds Renfrew, C., McMahon, A. & Trask, L.) 413–439 (The McDonald Institute for Archaeological Research, Cambridge, UK, 2000)
Google Scholar
Swadesh, M. Lexico-statistic dating of prehistoric ethnic contacts. Proc. Am. Phil. Soc. 96, 453–463 (1952)
Google Scholar
Bergsland, K. & Vogt, H. On the validity of glottochronology. Curr. Anthropol. 3, 115–153 (1962)
Article Google Scholar
Blust, R. in Time Depth in Historical Linguistics (eds Renfrew, C., McMahon, A. & Trask, L.) 311–332 (The McDonald Institute for Archaeological Research, Cambridge, UK, 2000)
Google Scholar
Steel, M. A., Hendy, M. D. & Penny, D. Loss of information in genetic distances. Nature 333, 494–495 (1988)
Article ADS Google Scholar
Swofford, D. L., Olsen, G. J., Waddell, P. J. & Hillis, D. M. in Molecular Systematics (eds Hillis, D., Moritz, C. & Mable, B. K.) 407–514 (Sinauer Associates, Inc, Sunderland, Massachusetts, 1996)
Google Scholar
Dixon, R. M. W. The Rise and Fall of Language (Cambridge Univ. Press, Cambridge, UK, 1997)
Book Google Scholar
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091 (1953)
Article ADS CAS Google Scholar
Huelsenbeck, J. P., Ronquist, F., Nielsen, R. & Bollback, J. P. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310–2314 (2001)
Article ADS CAS Google Scholar
Huson, D. H. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14, 68–73 (1998)
Article CAS Google Scholar
Sanderson, M. R8s, Analysis of Rates of Evolution,Version 1.50 (Univ. California, Davis, 2002)
Google Scholar
Dyen, I., Kruskal, J. B. & Black, P. FILE IE-DATA1. Available at 〈http://www.ntu.edu.au/education/langs/ielex/IE-DATA1〉 (1997).
Sanderson, M. J. & Donoghue, M. J. Patterns of variation in levels of homoplasy. Evolution 43, 1781–1795 (1989)
Article Google Scholar
Gamkrelidze, T. V. & Ivanov, V. V. Trends in Linguistics 80: Indo-European and the Indo-Europeans (Mouton de Gruyter, Berlin, 1995)
Book Google Scholar
Rexova, K., Frynta, D. & Zrzavy, J. Cladistic analysis of languages: Indo-European classification based on lexicostatistical data. Cladistics 19, 120–127 (2003)
Article Google Scholar
Ringe, D., Warnow, T. & Taylor, A. IndoEuropean and computational cladistics. Trans. Philol. Soc. 100, 59–129 (2002)
Article Google Scholar
Gkiasta, M., Russell, T., Shennan, S. & Steele, J. Neolithic transition in Europe: the radiocarbon record revisited. Antiquity 77, 45–62 (2003)
Article Google Scholar
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, Princeton, 1994)
MATH Google Scholar
Holden, C. J. Bantu language trees reflect the spread of farming across sub-Saharan Africa: a maximum-parsimony analysis. Proc. R. Soc. Lond. B 269, 793–799 (2002)
Article Google Scholar
Barbrook, A. C., Howe, C. J., Blake, N. & Robinson, P. The phylogeny of The Canterbury Tales. Nature 394, 839 (1998)
Article ADS CAS Google Scholar
McMahon, A. & McMahon, R. Finding families: Quantitative methods in language classification. Trans. Philol. Soc. 101, 7–55 (2003)
Article Google Scholar
Huelsenbeck, J. P. & Ronquist, F. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17, 754–755 (2001)
Article CAS Google Scholar

Download references

Acknowledgements

We thank S. Allan, L. Campbell, L. Chikhi, M. Corballis, N. Gavey, S. Greenhill, J. Hamm, J. Huelsenbeck, G. Nichols, A. Rodrigo, F. Ronquist, M. Sanderson and S. Shennan for useful advice and/or comments on the manuscript.

Author information

Authors and Affiliations

Department of Psychology, University of Auckland, Private Bag 92019, 1020, Auckland, New Zealand
Russell D. Gray & Quentin D. Atkinson

Authors

Russell D. Gray
View author publications
You can also search for this author in PubMed Google Scholar
Quentin D. Atkinson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Russell D. Gray.

Ethics declarations

Competing interests

The authors declare that they have no competing financial interests.

Supplementary information

Supplementary Table (DOC 35 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gray, R., Atkinson, Q. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426, 435–439 (2003). https://doi.org/10.1038/nature02029

Download citation

Received: 18 July 2003
Accepted: 22 August 2003
Issue Date: 27 November 2003
DOI: https://doi.org/10.1038/nature02029

This article is cited by

Inferring language dispersal patterns with velocity field estimation
- Sizhe Yang
- Xiaoru Sun
- Menghan Zhang
Nature Communications (2024)
Machine culture
- Levin Brinkmann
- Fabian Baumann
- Iyad Rahwan
Nature Human Behaviour (2023)
Reliability models in cultural phylogenetics
- Rafael Ventura
Biology & Philosophy (2023)
Valence-dependent mutation in lexical evolution
- Joshua Conrad Jackson
- Kristen Lindquist
- Joseph Watts
Nature Human Behaviour (2022)
An evolutionary view of institutional complexity
- Victor Zitian Chen
- John Cantwell
Journal of Evolutionary Economics (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Language-tree divergence times support the Anatolian theory of Indo-European origin

Abstract

Similar content being viewed by others

Phylogenomics and the rise of the angiosperms

Hybrid speciation driven by multilocus introgression of ecological traits

Network of large pedigrees reveals social practices of Avar communities

Main

Methods

Data and coding

Tree construction

Divergence-time estimates

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Supplementary Table (DOC 35 kb)

Rights and permissions

About this article

Cite this article

This article is cited by

Inferring language dispersal patterns with velocity field estimation

Machine culture

Reliability models in cultural phylogenetics

Valence-dependent mutation in lexical evolution

An evolutionary view of institutional complexity

Comments

Trees of life and of language

Search

Quick links

Abstract

Similar content being viewed by others

Main

Methods

Data and coding

Tree construction

Divergence-time estimates

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links