Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages

The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9000 years before present (yr B.P.), while others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6000 yr B.P. Here we present an extensive database of Indo-European core vocabulary that eliminates past inconsistencies in cognate coding. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, with a subsequent branch northward onto the steppe and then across Europe. We reconcile this hybrid hypothesis with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent. Description Editor’s summary Languages of the Indo-European family are spoken by almost half of the world’s population, but their origins and patterns of spread are disputed. Heggarty et al. present a database of 109 modern and 52 time-calibrated historical Indo-European languages, which they analyzed with models of Bayesian phylogenetic inference. Their results suggest an emergence of Indo-European languages around 8000 years before present. This is a deeper root date than previously thought, and it fits with an initial origin south of the Caucasus followed by a branch northward into the Steppe region. These findings lead to a “hybrid hypothesis” that reconciles current linguistic and ancient DNA evidence from both the eastern Fertile Crescent (as a primary source) and the steppe (as a secondary homeland). —SNV Indo-European languages emerged south of the Caucasus around 8300 years ago, followed by an expansion northward to the Steppe regions. INTRODUCTION Almost half the world’s population speaks a language of the Indo-European language family. It remains unclear, however, where this family’s common ancestral language (Proto-Indo-European) was initially spoken and when and why it spread through Eurasia. The “Steppe” hypothesis posits an expansion out of the Pontic-Caspian Steppe, no earlier than 6500 years before present (yr B.P.), and mostly with horse-based pastoralism from ~5000 yr B.P. An alternative “Anatolian” or “farming” hypothesis posits that Indo-European dispersed with agriculture out of parts of the Fertile Crescent, beginning as early as ~9500 to 8500 yr B.P. Ancient DNA (aDNA) is now bringing valuable new perspectives, but these remain only indirect interpretations of language prehistory. In this study, we tested between the time-depth predictions of the Anatolian and Steppe hypotheses, directly from language data. We report a new framework for the chronology and divergence sequence of Indo-European, using Bayesian phylogenetic methods applied to an extensive new dataset of core vocabulary across 161 Indo-European languages. RATIONALE Previous phylolinguistic analyses have produced conflicting results. We diagnosed and resolved the causes of this discrepancy, two in particular. First, the datasets used had limited language sampling and widespread coding inconsistency. Second, some analyses enforced the assumption that modern spoken languages derive directly from ancient written languages rather than from parallel spoken varieties. Together, these methodological problems distorted branch-length estimates and date inferences. We present a new dataset of cognacy (shared word origins) across Indo-European. This dataset eliminates past inconsistencies and provides a fuller and more balanced language sample, including 52 nonmodern languages for a denser set of time-calibration points. We applied ancestry-enabled Bayesian phylogenetic analysis to test rather than enforce direct ancestry assumptions. RESULTS Few ancient written languages are returned as direct ancestors of modern clades. We find a median root age for Indo-European of ~8120 yr B.P. (95% highest posterior density: 6740 to 9610 yr B.P.). Our chronology is robust across a range of alternative phylogenetic models and sensitivity analyses that vary data subsets and other parameters. Indo-European had already diverged rapidly into multiple major branches by ~7000 yr B.P., without a coherent non-Anatolian core. Indo-Iranic has no close relationship with Balto-Slavic, weakening the case for it having spread via the steppe. CONCLUSION Our results are not entirely consistent with either the Steppe hypothesis or the farming hypothesis. Recent aDNA evidence suggests that the Anatolian branch cannot be sourced to the steppe but rather to south of the Caucasus. For other branches, potential candidate expansion(s) out of the Yamnaya culture are detectable in aDNA, but some had only limited genetic impact. Our results reveal that these expansions from ~5000 yr B.P. onward also came too late for the language chronology of Indo-European divergence. They are consistent, however, with an ultimate homeland south of the Caucasus and a subsequent branch northward onto the steppe, as a secondary homeland for some branches of Indo-European entering Europe with the later Corded Ware–associated expansions. Language phylogenetics and aDNA thus combine to suggest that the resolution to the 200-year-old Indo-European enigma lies in a hybrid of the farming and Steppe hypotheses. A DensiTree showing the probability distribution of tree topologies for the Indo-European language family. The time axis shows the estimated chronology of the family’s geographical expansion and divergence, calibrated on 52 nonmodern written languages. Annotations add chronological context relative to selected archaeological cultures and expansions of significant ancestry components in the aDNA record. CHG, Caucasus hunter-gatherers; EHG, Eastern (European) hunter-gatherers; BMAC, Bactria-Margiana Archaeological Complex.

[1]  D. Reich,et al.  On the limits of fitting complex models of population history to f-statistics , 2023, eLife.

[2]  Anna J. Osterholtz,et al.  The genetic history of the Southern Arc: A bridge between West Asia and Europe , 2022, Science.

[3]  C. Warinner,et al.  Emergence and intensification of dairying in the Caucasus and Eurasian steppes , 2022, Nature ecology & evolution.

[4]  Domingo C. Salazar-García,et al.  Genomic transformation and social organization during the Copper Age–Bronze Age transition in southern Iberia , 2021, Science advances.

[5]  D. Reich,et al.  Dynamic changes in genomic and social structures in third millennium BCE central Europe , 2021, Science advances.

[6]  George Starostin,et al.  Rapid radiation of the inner Indo-European languages: an advanced approach to Indo-European lexicostatistics , 2021, Linguistics.

[7]  O. Delaneau,et al.  The genomic history of the Aegean palatial civilizations , 2021, Cell.

[8]  P. Heggarty Cognacy Databases and Phylogenetic Research on Indo-European , 2021 .

[9]  K. Nordqvist,et al.  The Forgotten Child of the Wider Corded Ware Family: Russian Fatyanovo Culture in Context , 2020, Proceedings of the Prehistoric Society.

[10]  Simon J. Greenhill,et al.  Bayesian Phylolinguistics , 2020, The Handbook of Historical Linguistics.

[11]  C. Warinner,et al.  Genomic History of Neolithic to Bronze Age Anatolia, Northern Levant, and Southern Caucasus , 2020, Cell.

[12]  Fintan Mallory The Case Against Linguistic Palaeontology , 2020 .

[13]  D. Reich,et al.  The Spread of Steppe and Iranian Related Ancestry in the Islands of the Western Mediterranean , 2020, Nature Ecology & Evolution.

[14]  D. Reich,et al.  An Ancient Harappan Genome Lacks Ancestry from Steppe Pastoralists or Iranian Farmers , 2019, Cell.

[15]  Thomas Olander Indo-European cladistic nomenclature , 2019, Indogermanische Forschungen.

[16]  Michael D. Frachetti,et al.  The formation of human populations in South and Central Asia , 2019, Science.

[17]  S. Ho,et al.  Influence of the tree prior and sampling scale on Bayesian phylogenetic estimates of the origin times of language families , 2019, Journal of Language Evolution.

[18]  Johann-Mattis List,et al.  Dated language phylogenies shed light on the ancestry of Sino-Tibetan , 2019, Proceedings of the National Academy of Sciences.

[19]  D. Reich,et al.  Ancient human genome-wide data from a 3000-year interval in the Caucasus corresponds with eco-geographic regions , 2019, Nature Communications.

[20]  M. Price Finding the first horse tamers. , 2018, Science.

[21]  Taraka Rama,et al.  Three tree priors and five datasets: A study of the effect of tree priors in Indo-European phylogenetics , 2018, Language Dynamics and Change.

[22]  Melissa A. Wilson Sayres,et al.  137 ancient human genomes from across the Eurasian steppes , 2018, Nature.

[23]  S. Pääbo,et al.  Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe , 2018, Nature Communications.

[24]  Quentin D Atkinson,et al.  The origin and expansion of Pama–Nyungan languages across Australia , 2018, Nature Ecology & Evolution.

[25]  Simon J. Greenhill,et al.  A Bayesian phylogenetic study of the Dravidian language family , 2018, Royal Society Open Science.

[26]  R. Gray,et al.  Language continuity despite population replacement in Remote Oceania , 2018, Nature Ecology & Evolution.

[27]  Marie Besse,et al.  The Beaker Phenomenon and the Genomic Transformation of Northwest Europe , 2018, Nature.

[28]  Simon J. Greenhill,et al.  Evolutionary dynamics of language systems , 2017, Proceedings of the National Academy of Sciences.

[29]  Edward J. Vajda,et al.  The Routledge handbook of historical linguistics , 2017 .

[30]  J. Stamatoyannopoulos,et al.  Genetic origins of the Minoans and Mycenaeans , 2017, Nature.

[31]  D. Reich,et al.  The Genomic History of Southeastern Europe , 2017, Nature.

[32]  K. Veeramah,et al.  Early Neolithic genomes from the eastern Fertile Crescent , 2016, Science.

[33]  Swapan Mallick,et al.  Genomic insights into the origin of farming in the ancient Near East , 2016, Nature.

[34]  Steven N. Dworkin,et al.  Lexical stability and shared lexicon , 2016 .

[35]  James Clackson,et al.  Latin as a source for the Romance languages , 2016 .

[36]  R. Sampson The Oxford Guide to the Romance Languages , 2016 .

[37]  Michael Armand P. Canilao First Farmers: The Origins of Agricultural Societies , 2016 .

[38]  Anders Eriksson,et al.  Upper Palaeolithic genomes reveal deep roots of modern Eurasians , 2015, Nature Communications.

[39]  Remco R. Bouckaert,et al.  Bayesian Evolutionary Analysis with BEAST , 2015 .

[40]  Chundra Cathcart,et al.  Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis , 2015 .

[41]  Swapan Mallick,et al.  Massive migration from the steppe was a source for Indo-European languages in Europe , 2015, Nature.

[42]  D. Ringe,et al.  The Indo-European Homeland from Linguistic and Archaeological Perspectives , 2015 .

[43]  Remco R. Bouckaert,et al.  DensiTree 2: Seeing Trees Through the Forest , 2014, bioRxiv.

[44]  P. Heggarty Prehistory through language and archaeology , 2014 .

[45]  Tanja Stadler,et al.  Bayesian Inference of Sampled Ancestor Trees for Epidemiology and Fossil Calibration , 2014, PLoS Comput. Biol..

[46]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[47]  M. Suchard,et al.  Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. , 2012, Molecular biology and evolution.

[48]  Simon J. Greenhill,et al.  Mapping the Origins and Expansion of the Indo-European Language Family , 2012, Science.

[49]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[50]  Geoff K. Nicholls,et al.  Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto‐Indo‐European , 2011 .

[51]  Christopher Ehret,et al.  Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East , 2009, Proceedings of the Royal Society B: Biological Sciences.

[52]  Simon J. Greenhill,et al.  Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement , 2009, Science.

[53]  R. Bremmer An Introduction to Old Frisian: History, Grammar, Reader, Glossary , 2009 .

[54]  Christopher J. Lee,et al.  Wagner and Dollo: a stochastic duet by composing two parsimonious solos. , 2008, Systematic biology.

[55]  Lyle Campbell,et al.  Language Classification: History and Method , 2008 .

[56]  Michaël Peyrot,et al.  Variation and Change in Tocharian B , 2008 .

[57]  David W. Anthony,et al.  The Horse, the Wheel, and Language: How Bronze-Age Riders from the Eurasian Steppes Shaped the Modern World , 2008 .

[58]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[59]  Charles R. Clement,et al.  First Farmers: The Origins of Agricultural Societies , 2006 .

[60]  P. Forster,et al.  Phylogenetic Methods and the Prehistory of Languages , 2006 .

[61]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[62]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[63]  Modern Greek , 2005, A History of the Greek Language.

[64]  Michael Balter,et al.  Search for the Indo-Europeans , 2004, Science.

[65]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[66]  J. Diamond,et al.  Farmers and Their Languages: The First Expansions , 2003, Science.

[67]  Corrections and Clarifications , 2002, Science.

[68]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[69]  Michael D. Hendy,et al.  Mathematical Elegance with Biochemical Realism: The Covarion Model of Molecular Evolution , 2001, Journal of Molecular Evolution.

[70]  P. Lewis A likelihood approach to estimating phylogeny from discrete morphological character data. , 2001, Systematic biology.

[71]  Edwin Bryant,et al.  The Quest for the Origins of Vedic Culture , 2001 .

[72]  Larry Trask,et al.  The Dictionary of Historical and Comparative Linguistics , 2000 .

[73]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[74]  E. Gussmann Indo-European and the Indo-Europeans. A reconstruction and historical analysis of a proto-language and a proto-culture , 1999 .

[75]  P. Sims‐Williams Genetics, linguistics, and prehistory: thinking big and thinking straight , 1998, Antiquity.

[76]  Thomas V. Gamkrelidze,et al.  Indo-European and the Indo-Europeans : a reconstruction and historical analysis of a proto-language and a proto-culture , 1997 .

[77]  G. Horrocks Greek: A History of the Language and Its Speakers , 1997 .

[78]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[79]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[80]  Colin P. Masica The Indo-Aryan Languages , 1991 .

[81]  A. Kehoe Archaeology and Language: The Puzzle of Indo-European Origins , 1989, American Antiquity.

[82]  C. Renfrew,et al.  Searching for the Origins of Indo-European Languages@@@Archaeology and Language: The Puzzle of Indo-European Origins , 1988 .

[83]  R. Strand Notes on the Nūristāni and Dardic Languages@@@Notes on the Nuristani and Dardic Languages , 1973 .

[84]  D. M. Jones The Tongues of Italy , 1960, The Classical Review.

[85]  R. Hall The Oaths of Strassburg: Phonemics and Classification , 1953 .

[86]  S. Thurber English , 1894 .

[87]  Martin L. D'Oodge Greek , 1894, Cheers!.

[88]  P. Heggarty Why Indo-European? Clarifying cross-disciplinary misconceptions on farming vs. pastoralism , 2018 .

[89]  P. Heggarty Indo-European and the ancient DNA revolution , 2018 .

[90]  Benjamin Naumann The Quest For The Origins Of Vedic Culture The Indo Aryan Migration Debate , 2016 .

[91]  Petra Koenig The Regional Diversification Of Latin 200 Bc Ad 600 , 2016 .

[92]  Maik Moeller,et al.  Historical And Comparative Linguistics , 2016 .

[93]  K. Alexei,et al.  The Swadesh wordlist. An attempt at semantic specification , 2010 .

[94]  Uri Tadmor,et al.  Loanwords in the world's languages : findings and results , 2009 .

[95]  Geert H. M. Claassens Karel ende Elegast , 2008 .

[96]  J. Adams The Regional Diversification of Latin 200 BC–AD 600: Africa , 2007 .

[97]  G. Horrocks,et al.  The Blackwell history of the Latin language , 2007 .

[98]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[99]  Eugen Schumacher Stefan Hill,et al.  International Journal of Diachronic Linguistics and Linguistic Reconstruction , 2004 .

[100]  P. King Greek: a history of the language and its speakers , 1999, Byzantine and Modern Greek Studies.

[101]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[102]  J. Healey Archaeology and Language: the puzzle of indo-european origins. By Colin Renfrew , 1989 .

[103]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[104]  F. Ronquist,et al.  Ecology, Evolution and Organismal Biology Publications Ecology, Evolution and Organismal Biology Total-evidence Dating under the Fossilized Birth–death Process , 2022 .