Feature Selection in Taxonomies with Applications to Paleontology

Taxonomies for a set of features occur in many real-world domains. An example is provided by paleontology, where the task is to determine the age of a fossil site on the basis of the taxa that have been found in it. As the fossil record is very noisy and there are lots of gaps in it, the challenge is to consider taxa at a suitable level of aggregation: species, genus, family, etc. For example, some species can be very suitable as features for the age prediction task, while for other parts of the taxonomy it would be better to use genus level or even higher levels of the hierarchy. A default choice is to select a fixed level (typically species or genus); this misses the potential gain of choosing the proper level for sets of species separately. Motivated by this application we study the problem of selecting an antichain from a taxonomy that covers all leaves and helps to predict better a specified target variable. Our experiments on paleontological data show that choosing antichains leads to better predictions than fixing specific levels of the taxonomy beforehand.

[1]  D. R. Fulkerson,et al.  Maximal Flow Through a Network , 1956 .

[2]  Vasant Honavar,et al.  Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data , 2006, Knowledge and Information Systems.

[3]  Aristides Gionis,et al.  Spectral ordering and biochronology of European fossil mammals , 2006, Paleobiology.

[4]  Thomas Hofmann,et al.  Exploiting Known Taxonomies in Learning Overlapping Concepts , 2007, IJCAI.

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Venkatesan Guruswami,et al.  Combinatorial feature selection problems , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  David K. Smith Network Flows: Theory, Algorithms, and Applications , 1994 .

[8]  Nada Lavrac,et al.  Relevancy in Constraint-Based Subgroup Discovery , 2004, Constraint-Based Mining and Inductive Databases.

[9]  Ming-Syan Chen,et al.  Using category-based adherence to cluster market-basket data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Mikael Fortelius,et al.  Common mammals drive the evolutionary increase of hypsodonty in the Neogene , 2002, Nature.

[11]  Lorenza Saitta,et al.  Abstraction, Reformulation and Approximation , 2008 .

[12]  Heikki Mannila,et al.  Higher origination and extinction rates in larger mammals , 2008, Proceedings of the National Academy of Sciences.

[13]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[14]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[15]  Marie desJardins,et al.  Using Feature Hierarchies in Bayesian Network Learning , 2000, SARA.