Strong Compound-Risk Factors: Efficient Discovery Through Emerging Patterns and Contrast Sets

Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound- risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound- risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes.

[1]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[2]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Steve Evans,et al.  Step by step: Breaking outsourcing down into manageable phases , 2004 .

[5]  Kotagiri Ramamohanarao,et al.  The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms , 2000, ICML.

[6]  J. Zhang,et al.  What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. , 1998, JAMA.

[7]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[8]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  J. Deeks When can odds ratios mislead? , 1998 .

[11]  Douglas G Altman,et al.  Odds ratios should be avoided when events are common , 1998, BMJ.

[12]  M. Bracken,et al.  When can odds ratios mislead? Avoidable systematic error in estimating treatment effects must not be tolerated. , 1998, BMJ.

[13]  Jianping Li,et al.  On the complexity of finding emerging patterns , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[17]  J. Terwilliger Genetic Variation and Human Disease: Principles and Evolutionary Approaches , 1997 .

[18]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[19]  Bracken Mb,et al.  When can odds ratios mislead? Avoidable systematic error in estimating treatment effects must not be tolerated. , 1998 .

[20]  H. Davies,et al.  When can odds ratios mislead? , 1998, BMJ.

[21]  J. Manson,et al.  Male pattern baldness and coronary heart disease: the Physicians' Health Study. , 2000, Archives of internal medicine.

[22]  D. Neumark-Sztainer,et al.  The social environments of adolescents: associations between socioenvironmental factors and health behaviors during adolescence. , 1999, Adolescent medicine.

[23]  B van Hout,et al.  How should different life expectancies be valued? Diminishing marginal utility and discounting future effects have similar consequences. , 1998, BMJ.

[24]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[25]  Jinyan Li,et al.  Relative risk and odds ratio: a data mining perspective , 2005, PODS '05.

[26]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  Sylvia Wassertheil-Smoller,et al.  Biostatistics and Epidemiology , 1995, Springer New York.

[29]  H Hausen,et al.  Patients' expectations of an ideal dentist and their views concerning the dentist they visited: do the views conform to the expectations and what determines how well they conform? , 1996, Community dentistry and oral epidemiology.

[30]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[31]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[32]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[33]  G. Wright,et al.  Patient Management: A review of patient satisfaction: 2. Dental patient satisfaction: an appraisal of recent literature , 1999, British Dental Journal.

[34]  M. Espeland,et al.  Satisfaction of the older patient with dental care. , 1986, Gerodontics.

[35]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[36]  Kotagiri Ramamohanarao,et al.  Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[37]  J. Hair Multivariate data analysis , 1972 .

[38]  Prh Newsome,et al.  A review of patient satisfaction , 1999 .

[39]  Jonathan J. Deeks,et al.  Down with odds ratios! , 1996, Evidence Based Medicine.

[40]  T. Cook Advanced statistics: up with odds ratios! A case for odds ratios when outcomes are common. , 2002, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[41]  James Bailey,et al.  Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams , 2006, KDD '06.

[42]  A. Hajeer The Genetic Variation and Human Disease: Principles and Evolutionary Approaches , 1996 .

[43]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[44]  Jian Pei,et al.  Minimum Description Length Principle: Generators Are Preferable to Closed Patterns , 2006, AAAI.