Tree Reconstruction from Multi-State Characters

In evolutionary biology, a character is a function @g from a set X of present-day species into a finite set of states. Suppose the species in X have evolved according to a bifurcating tree T. Biologists would like to use characters to infer this tree. Assume that @g is the result of an evolutionary process on T that has not involved reverse or parallel transitions; such characters are called homoplasy-free. In this case, @g provides direct combinatorial information about the underlying evolutionary tree T for X. We consider the question of how many homoplasy-free characters are required so that T can be correctly reconstructed. We first establish lower bounds showing that, when the number of states is bounded, the number of homoplasy-free characters required to reconstruct T grows (at least) linearly with the size of X. In contrast, our main result shows that, when the state space is sufficiently large, every bifurcating tree can be uniquely determined by just five homoplasy-free characters. We briefly describe the relevance of this result for some new types of genomic data, and for the amalgamation of evolutionary trees.