Inferring Phylogenetic Graphs for Natural Languages using MML

Languages, like everything around us, evolve and change over a period of time. The aim of this report is to be able to model this evolution that occurs between natural languages. We introduce the idea of inferring phylogenetic (or evolutionary) models for natural languages using the MinimumMessage Length (MML) principle. Phylogenetic models show the evolutionary interrelationship among various species or other entities. We extend phylogenetic trees to phylogenetic graphs. Minimum Message Length (MML) is an inductive inference method that measures the goodness of a model. We use MML to infer phylogenetic graphs (including mutation probabilities along arcs). We introduce the use of MML to infer phylogenetic graphs for artificial languages as well as for some European languages (English, French, Spanish and German). Unlike phylogenetic trees, phylogenetic graphs are capable of modelling evolution where a child node inherits features from more than one parent node. In a phylogenetic tree, each child node has exactly one parent node. This means that each child language is allowed to inherit from only one parent language. However, it is clear that in the real world, such a situation is unlikely to occur. Hence, we extend phylogenetic trees to phylogenetic graphs to model the fact that a language can be influenced by more than one other language. The first part of our modelling assumes only copy and change operations on characters, and is based on words that have the same length in all natural languages considered, whereas the subsequent section uses string alignment techniques to model words with different lengths and allows for copy, change, insert and delete operations on characters. All methods have been verified by testing them on artificial languages for which the evolutionary order is known. The resulting phylogenetic model inferred by MML reflects the correct evolutionary order.

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[3]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[4]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[5]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[6]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[7]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[8]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[9]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[10]  I. R. MacKay David Crystal. A Dictionary of Linguistics and Phonetics. 2nd ed. London: Blackwell. 1985. , 1987, Canadian Journal of Linguistics/Revue canadienne de linguistique.

[11]  Lloyd Allison,et al.  Minimum message length encoding, evolutionary trees and multiple-alignment , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[12]  Michael E. Krauss The world's languages in crisis , 2015 .

[13]  Jonathan J. Oliver Decision Graphs - An Extension of Decision Trees , 1993 .

[14]  David L. Dowe,et al.  MML Estimation of the Parameters of the Sherical Fisher Distribution , 1996, ALT.

[15]  C. S. Wallace,et al.  Bayesian Estimation of the Von Mises Concentration Parameter , 1996 .

[16]  David L. Dowe,et al.  Refinements of MDL and MML Coding , 1999, Comput. J..

[17]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[18]  Luc Steels,et al.  The puzzle of language evolution , 2000, Kognitionswissenschaft.

[19]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[20]  David L. Dowe,et al.  Message Length as an Effective Ockham's Razor in Decision Tree Induction , 2001, International Conference on Artificial Intelligence and Statistics.

[21]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[22]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2002, Australian Conference on Artificial Intelligence.

[23]  David L. Dowe,et al.  Unsupervised Learning of Correlated Multivariate Gaussian Mixture Models Using MML , 2003, Australian Conference on Artificial Intelligence.

[24]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[25]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2003, Australian Conference on Artificial Intelligence.

[26]  David L. Dowe,et al.  General Bayesian networks and asymmetric languages , 2003 .

[27]  Leigh J. Fitzgibbon,et al.  Minimum message length autoregressive model order selection , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[28]  David L. Dowe,et al.  MML Inference of Oblique Decision Trees , 2004, Australian Conference on Artificial Intelligence.

[29]  David L. Dowe,et al.  Inferring phylogenetic graphs of natural languages using minimum message length , 2005 .

[30]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .