From black and white to full color: extending redescription mining outside the Boolean world

Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. Current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a pre-requisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to categorical and real-valued data with possibly missing values using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behavior of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 (Part of this work was done when the author was with HIIT.)

[1]  Jorge Soberón,et al.  Niches and distributional areas: Concepts, methods, and assumptions , 2009, Proceedings of the National Academy of Sciences.

[2]  Heikki Mannila,et al.  Randomization methods for assessing data analysis results on real-valued matrices , 2009 .

[3]  T. Dawson,et al.  Predicting the impacts of climate change on the distribution of species: are bioclimate envelope models useful? , 2003 .

[4]  Deept Kumar,et al.  Turning CARTwheels: an alternating algorithm for mining redescriptions , 2003, KDD.

[5]  Arno Knobbe,et al.  Exceptional Model Mining , 2008, ECML/PKDD.

[6]  J. L. Parra,et al.  Very high resolution interpolated climate surfaces for global land areas , 2005 .

[7]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[8]  J. Grinnell The Niche-Relationships of the California Thrasher , 1917 .

[9]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[10]  Jan Zima,et al.  The Atlas of European Mammals , 1999 .

[11]  A. Peterson,et al.  INTERPRETATION OF MODELS OF FUNDAMENTAL ECOLOGICAL NICHES AND SPECIES' DISTRIBUTIONAL AREAS , 2005 .

[12]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[13]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[14]  Matthijs van Leeuwen,et al.  Maximal exceptions with minimal descriptions , 2010, Data Mining and Knowledge Discovery.

[15]  Naren Ramakrishnan,et al.  Redescription Mining: Structure Theory and Algorithms , 2005, AAAI.

[16]  Raúl E. Valdés-Pérez,et al.  Differentiating 451 languages in terms of their segment inventories , 2002 .

[17]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[18]  Naren Ramakrishnan,et al.  Reasoning about sets using redescription mining , 2005, KDD '05.

[19]  Deept Kumar,et al.  Redescription Mining: Algorithms and Applications in Bioinformatics , 2007 .

[20]  Gemma C. Garriga,et al.  Cross-Mining Binary and Numerical Attributes , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[21]  Pauli Miettinen,et al.  Finding Subgroups having Several Descriptions: Algorithms for Redescription Mining , 2008, SDM.

[22]  Johannes Fürnkranz,et al.  Guest Editorial: Global modeling using local patterns , 2010, Data Mining and Knowledge Discovery.

[23]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[24]  Annie Morin,et al.  Subgroup Discovery in Data Sets with Multi-dimensional Responses: A Method and a Case Study in Traumatology , 2009, AIME.

[25]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[26]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..