A Compression-Based Method for Stemmatic Analysis

Stemmatology studies relations among different variants of a text that has been gradually altered as a result of imperf ectly copying the text over and over again. Applications are mainly in hu- manities, especially textual criticism, but the methods ca n be used to study the evolution of any symbolic objects, including chain let- ters and computer viruses.We propose an algorithm for stemmatic analysis based on a minimum-information criterion and stochastic tree optimization. Our approach is related to phylogenetic recon- struction criteria such as maximum parsimony and maximum like- lihood, and builds upon algorithmic techniques developed for bioin- formatics. Unlike many earlier methods, the proposed method does not require significant preprocessing of the data but rather , operates directly on aligned text files. We demonstrate our method on a real- world experiment involving all 52 known variants of the legend of St. Henry of Finland, and provide the first computer-generat ed fam- ily tree of the legend. The obtained tree of the variants is su pported to a large extent by results obtained with more traditional m ethods, and identifies a number of previously unrecognized relation s.

[1]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[2]  Tal Pupko,et al.  A Structural EM Algorithm for Phylogenetic Inference , 2002, J. Comput. Biol..

[3]  A C Barbrook,et al.  Manuscript evolution. , 2001, Endeavour.

[4]  Jean-Paul Delahaye,et al.  The transformation distance: A dissimilarity measure based an movements of segments , 1998, German Conference on Bioinformatics.

[5]  Caroline Macé,et al.  Phylogenetic analysis of Gregory of Nazianzus' Homily 27 , 2004 .

[6]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[7]  P. Robinson,et al.  Report on the Textual Criticism Challenge 1991 , 1992 .

[8]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[9]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[10]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[11]  Matthew Spencer,et al.  How Accurate Were Scribes? A Mathematical Model , 2002, Lit. Linguistic Comput..

[12]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[13]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[14]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[15]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[16]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[17]  E. Wattel,et al.  Weighted formal support of a pedigree , 1995 .

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[20]  Tuomas Heikkilä,et al.  Compression-based Stemmatology: A Study of the Legend of St. Henry of Finland , 2005 .

[21]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[22]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[23]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[24]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.