论文信息 - Données déséquilibrées, entropie décentrée et indice d'implication

Données déséquilibrées, entropie décentrée et indice d'implication

Cet article porte sur l'induction d'arbres de classification pour des donnees desequilibrees, c'est-a-dire lorsque certaines categories de la variable a predire sont beaucoup plus rares que d'autres. Plus particulierement nous nous interessons a deux aspects: d'une part, a definir des criteres de construction de l'arbre qui exploitent efficacement la nature desequilibree des donnees, et d'autre part la pertinence de la conclusion a associer aux feuilles de l'arbre. Nous avons recemment aborde cette problematique sous deux angles independants: l'un etait axe sur le recours a des entropies decentrees, l'autre s'appuyant sur des mesures d'intensites d'implication issues de l'ASI. Nous nous proposons ici de comparer et d'etablir les similarites entre ces deux approches. - This paper is concerned with the induction of classification trees for imbalanced data, i.e. for the case where some categories of the target variable are much less frequent than other ones. More specifically, we address two aspects. On the one hand, we look for growing criteria that efficiently take into account the specific imbalanced nature of the data. On the other hand, we deal with the relevance of the conclusion that should be assigned to the leaves of a grown tree. We have recently considered two independent ways for dealing with these issues. The first one consisted in defining and using out centered entropies, and the second one on relying on measures of implication strength derived from implicative statistics. The aim of this paper is to compare and establish the relationship between these two approaches.

Gilbert Ritschard | Djamel A. Zighed | Simon Marcellin | Gilbert Ritschard | Simon Marcellin

[1] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[2] Chao Chen,et al. Using Random Forest to Learn Imbalanced Data , 2004 .

[3] G. Ritschard. De l’usage de la statistique implicative dans les arbres de classification , 2005 .

[4] Djamel A. Zighed,et al. Choix des conclusions et validation des règles issues d'arbres de classification , 2007, EGC.

[5] Régis Gras,et al. Une version discriminante de l'indice probabiliste d'ècart à l'èquilibre pour mesure la qualité des règles , 2005 .

[6] Régis Gras,et al. Élaboration et évaluation d'un indice d'implication pour des données binaires. I , 1981 .

[7] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[8] R. Barandelaa,et al. Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[9] Djamel A. Zighed,et al. Detection of breast cancer using an asymmetric entropy measure , 2006 .

[10] B. Vaillant,et al. Variations autour de l'intensité d'implication , 2005 .

[11] Djamel A. Zighed,et al. Mesure d'entropie asymétrique et consistante , 2007, EGC.

[12] S. Jaroszewicz,et al. A General Measure of Rule Interestingness , 2001, PKDD.

[13] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .