Inferring phylogenetic graphs of natural languages using minimum message length

We extend phylogenetic (or evolutionary) trees to phylogenetic graphs. Unlike phylogenetic trees, phylogenetic graphs are capable of modelling evolution where a child node inherits from more than one parent node. Minimum Message Length (MML)(Wallace and Boulton 1968; Wallace 2005) is an inductive inference method that measures the goodness of a model. We use MML to infer phylogenetic graphs (including mutation probabilities along arcs). We introduce the use of MML to infer phylogenetic graphs for artificial languages as well as for some European languages (English, French and Spanish). Our modelling assumes only copy and change operations on characters, and is based on words which have the same length in all natural languages considered.

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2]  David L. Dowe,et al.  Inferring Phylogenetic Graphs for Natural Languages using MML , 2005 .

[3]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[4]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[5]  David L. Dowe,et al.  Minimum message length and generalized Bayesian nets with asymmetric languages , 2005 .

[6]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[7]  Lloyd Allison,et al.  Minimum message length encoding, evolutionary trees and multiple-alignment , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[8]  David L. Dowe,et al.  MML Inference of Oblique Decision Trees , 2004, Australian Conference on Artificial Intelligence.

[9]  David L. Dowe,et al.  Message Length as an Effective Ockham's Razor in Decision Tree Induction , 2001, International Conference on Artificial Intelligence and Statistics.

[10]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[11]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[12]  David L. Dowe,et al.  General Bayesian networks and asymmetric languages , 2003 .

[13]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[14]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[15]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[16]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[17]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[18]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2003, Australian Conference on Artificial Intelligence.

[19]  David L. Dowe,et al.  Refinements of MDL and MML Coding , 1999, Comput. J..

[20]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2002, Australian Conference on Artificial Intelligence.

[21]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[22]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .