An experimental study comparing linguistic phylogenetic reconstruction methods

This paper reports a simulation study comparing and evaluating the performance of different linguistic phylogeny reconstruction methods on model datasets for which the true trees are known. UPGMA performed least well, then (in ascending order) neighbor joining, the method of Gray & Atkinson and finally maximum parsimony. Weighting characters greatly improves the accuracy of maximum parsimony and maximum compatibility if the characters with high weights exhibit low homoplasy.

[1]  Bernard M. E. Moret,et al.  Phylogenetic Inference , 2011, Encyclopedia of Parallel Computing.

[2]  J. Kim,et al.  Scaling of Accuracy in Extremely Large Phylogenetic Trees , 2000, Pacific Symposium on Biocomputing.

[3]  J. Huelsenbeck,et al.  MRBAYES : Bayesian inference of phylogeny , 2001 .

[4]  G. Carlson Language: Journal of the Linguistic Society of America , 2009 .

[5]  Daniel H. Huson,et al.  SplitsTree-a program for analyzing and visualizing evolutionary data , 1997 .

[6]  Russell D. Gray,et al.  Rapid radiation, borrowing and dialect continua in the Bantu languages , 2006 .

[7]  G. Nicholls,et al.  FROM WORDS TO DATES: WATER INTO WINE, MATHEMAGIC OR PHYLOGENETIC INFERENCE? , 2005 .

[8]  James W. Minett,et al.  Vertical and horizontal transmission in language evolution , 2005 .

[9]  April McMahon,et al.  Why linguists don’t do dates , 2006 .

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  Tandy J. Warnow,et al.  The Accuracy of Fast Phylogenetic Methods for Large Datasets , 2001, Pacific Symposium on Biocomputing.

[12]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[13]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[14]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[15]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[16]  T. Warnow,et al.  Unidentifiable divergence times in rates-across-sites models , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Geoff K. Nicholls,et al.  Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European , 2009 .

[18]  Tandy J. Warnow,et al.  The Performance of Phylogenetic Methods on Trees of Bounded Diameter , 2001, WABI.

[19]  Tandy J. Warnow,et al.  Tutorial on Computational Linguistic Phylogeny , 2008, Lang. Linguistics Compass.

[20]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[21]  D. Ord,et al.  PAUP:Phylogenetic analysis using parsi-mony , 1993 .

[22]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[23]  T. Warnow,et al.  INFERENCE OF DIVERGENCE TIMES AS A STATISTICAL INVERSE PROBLEM , 2004 .

[24]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[25]  V. Moulton,et al.  Neighbor-net: an agglomerative method for the construction of phylogenetic networks. , 2002, Molecular biology and evolution.

[26]  Tandy J. Warnow,et al.  Phylogenetic networks: modeling, reconstructibility, and accuracy , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  T. Warnow,et al.  A STOCHASTIC MODEL OF LANGUAGE EVOLUTION THAT INCORPORATES HOMOPLASY AND BORROWING , 2005 .

[28]  Tandy J. Warnow,et al.  Towards the Development of Computational Tools for Evaluating Phylogenetic Network Reconstruction Methods , 2002, Pacific Symposium on Biocomputing.

[29]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[30]  Mike A. Steel,et al.  Fast algorithms for constructing optimal trees from quartets , 1999, SODA '99.

[31]  D. Ringe,et al.  Recent Work in Computational Linguistic Phylogeny , 2004 .

[32]  Geoff K. Nicholls,et al.  Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto‐Indo‐European , 2011 .

[33]  H. Bandelt,et al.  Median-joining networks for inferring intraspecific phylogenies. , 1999, Molecular biology and evolution.

[34]  Henry M. Hoenigswald Language Change and Linguistic Reconstruction , 1960 .

[35]  J. Huelsenbeck,et al.  Hobgoblin of phylogenetics? , 1994, Nature.

[36]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[37]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[38]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[39]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[41]  Gillian Sankoff,et al.  Linguistic Outcomes of Language Contact , 2002 .

[42]  R. Gray,et al.  Language-tree divergence times support the Anatolian theory of Indo-European origin , 2003, Nature.

[43]  Cynthia A. Phillips,et al.  Constructing evolutionary trees in the presence of polymorphic characters , 1996, STOC '96.

[44]  C. Holden,et al.  Bantu language trees reflect the spread of farming across sub-Saharan Africa: a maximum-parsimony analysis , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[45]  A. Dress,et al.  Split decomposition: a new and useful approach to phylogenetic analysis of distance data. , 1992, Molecular phylogenetics and evolution.

[46]  P. Forster,et al.  Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[47]  P. Forster,et al.  Phylogenetic Methods and the Prehistory of Languages , 2006 .

[48]  T. Warnow,et al.  A Comparison of Phylogenetic Reconstruction Methods on an IE Dataset , 2004 .

[49]  H. Bandelt,et al.  Mitochondrial portraits of human populations using median networks. , 1995, Genetics.

[50]  Geoff K. Nicholls,et al.  Dated ancestral trees from binary trait data and their application to the diversification of languages , 2007, 0711.1874.

[51]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[52]  H. Bandelt,et al.  Median networks: speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA. , 2000, Molecular phylogenetics and evolution.