Searching for Patterns in Imbalanced Data - Methods and Alternatives with Case Studies in Life Sciences

The prime motivation for pattern discovery and machine learning research has been the collection and warehousing of large amounts of data, in many domains such as life sciences and industrial processes. Examples of unique problems arisen are situations where the data is imbalanced. The class imbalance problem corresponds to situations where majority of cases belong to one class and a small minority belongs to the other, which in many cases is equally or even more important. To deal with this problem a number of approaches have been studied in the past. In this talk we provide an overview of some existing methods and present novel applications that are based on identifying the inherent characteristics of one class vs the other. We present the results of a number of studies focusing on real data from life science applications.

[1]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[2]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[3]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[4]  Martin Dugas,et al.  Quantitative comparison of microarray experiments with published leukemia related gene expression signatures , 2009, BMC Bioinformatics.

[5]  Weiling Xu,et al.  An Approach to Automated Knowledge Discovery in Bioinformatics , 2005, AIAI.

[6]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[7]  Lakhmi C. Jain,et al.  Successful Case-based Reasoning Applications-2 , 2013 .

[8]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[9]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  J. Gasson,et al.  Characterization of HOX gene expression during myelopoiesis: role of HOX A5 in lineage commitment and maturation. , 1999, Blood.

[13]  Fatima Al-Shahrour,et al.  Musashi-2 regulates normal hematopoiesis and promotes aggressive myeloid leukemia , 2010, Nature Medicine.

[14]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[15]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[16]  Maarten van Someren,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004, Machine Learning.

[17]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Hugo Jair Escalante,et al.  Hands on Pattern Recognition , 2011 .

[20]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[21]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[22]  Raju S. Bapi,et al.  An Unbalanced Data Classification Model Using Hybrid Sampling Technique for Fraud Detection , 2007, PReMI.

[23]  Sara Colantonio,et al.  Prototype–Based Classification in Unbalanced Biomedical Problems , 2010 .

[24]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[25]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[26]  D. Lancet,et al.  GeneCards: integrating information about genes, proteins and diseases. , 1997, Trends in genetics : TIG.

[27]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..