Mixed-sampling approach to unbalanced data distributions: a case study involving Leukemia's document profiling

Leukemia's types and their relationships to literatures are introduced, based on which data set about Leukemia for classification is constructed with original data sources, such as Cancer Gene Census, PubMed and gene2pubmed. The data set is imbalanced as the research object. Based on the introduction of current classification methods of imbalanced data set, the problems of sampling in imbalanced data set are analyzed, and mixed-sampling method is proposed to classify the Leukemia data set. The multi-class problem about Leukemia is transferred to a set of two-class problems. Area Under Receiver Operating Characteristic (ROC) Curve (AUC) are used to evaluate the mixed-sampling method. Then, experiments are performed to verify the classification efficiency and stability of eight classification methods, and their classification results are comparatively analyzed. It can be found that the mixed-sampling method achieves the best performance. At last, the research work in this paper is concluded with a look forward to the future work.

[1]  Harvey Herschman,et al.  B-cell activating factor and v-Myc myelocytomatosis viral oncogene homolog (c-Myc) influence progression of chronic lymphocytic leukemia , 2010, Proceedings of the National Academy of Sciences.

[2]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[3]  Josef Kittler,et al.  A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling , 2009, MCS.

[4]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[5]  Marshall A Lichtman,et al.  Familial (inherited) leukemia, lymphoma, and myeloma: an overview. , 2004, Blood cells, molecules & diseases.

[6]  Qingqiang Wu,et al.  Co-word analysis of the trends in stem cells field based on subject heading weighting , 2011, Scientometrics.

[7]  E. Sonnhammer,et al.  Network-based Identification of Novel Cancer Genes , 2009, Molecular & Cellular Proteomics.

[8]  T Robak,et al.  Abnormalities of the P53, MDM2, BCL2 and BAX genes in acute leukemias. , 2005, Neoplasma.

[9]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[10]  K. Foon,et al.  Chronic lymphocytic leukemia. Recent advances in biology and treatment. , 1985, Annals of internal medicine.

[11]  Eric J. Topol,et al.  An ensemble method for gene discovery based on DNA microarray data , 2004, Science in China Series C: Life Sciences.

[12]  J. Tchinda,et al.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. , 2006, Science.

[13]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[14]  José Hernández-Orallo,et al.  Volume under the ROC Surface for Multi-class Problems , 2003, ECML.

[15]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[16]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[17]  M. Stratton,et al.  A census of amplified and overexpressed human cancer genes , 2010, Nature Reviews Cancer.

[18]  J. Dick,et al.  Human acute myeloid leukemia is organized as a hierarchy that originates from a primitive hematopoietic cell , 1997, Nature Medicine.

[19]  A. Krogh,et al.  Statistical mechanics of ensemble learning , 1997 .

[20]  L. Pagano,et al.  Design and Methods , 2022 .

[21]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[22]  George Hripcsak,et al.  Gene symbol disambiguation using knowledge-based profiles , 2007, Bioinform..

[23]  L. del Vecchio,et al.  c-fos oncogene expression in human hematopoietic malignancies is restricted to acute leukemias with monocytic phenotype and to subsets of B cell leukemias. , 1987, Blood.

[24]  De-Shuang Huang,et al.  Cancer classification using Rotation Forest , 2008, Comput. Biol. Medicine.

[25]  W. Hiddemann,et al.  CLINICAL OBSERVATIONS, INTERVENTIONS, AND THERAPEUTIC TRIALS Global approach to the diagnosis of leukemia using gene expression profiling , 2022 .

[26]  M. Lishner,et al.  The BCL-1, BCL-2, and BCL-3 oncogenes are involved in chronic lymphocytic leukemia. Detection by fluorescence in situ hybridization. , 1995, Cancer genetics and cytogenetics.

[27]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[28]  D. Chaussabel,et al.  Mining microarray expression data by literature profiling , 2002, Genome Biology.

[29]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[30]  Simo V. Zhang,et al.  A map of human cancer signaling , 2007, Molecular systems biology.

[31]  Bungo Saito,et al.  p53 Protein Expression in Chronic Myelomonocytic Leukemia-1 Correlates with Progression to Leukemia and a Poor Prognosis , 2011, Acta Haematologica.

[32]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[33]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[34]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[36]  Chun-Gui Xu,et al.  A genetic programming-based approach to the classification of multiclass microarray datasets , 2009, Bioinform..

[37]  Ji-Xiang Du,et al.  Microarray data classification based on ensemble independent component selection , 2009, Comput. Biol. Medicine.

[38]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[39]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[40]  C. Sawyers Chronic myeloid leukemia. , 1999, The New England journal of medicine.

[41]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[42]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[43]  Bhavani M. Thuraisingham,et al.  A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams , 2009, PAKDD.

[44]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[45]  Kemal Polat,et al.  A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems , 2009, Expert Syst. Appl..

[46]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[47]  R. Gale,et al.  Acute lymphoblastic leukemia: recent advances in biology and therapy. , 1989, Blood.

[48]  Sauchi Stephen Lee Noisy replication in skewed binary classification , 2000 .

[49]  Ji-Xiang Du,et al.  Ensemble component selection for improving ICA based microarray data prediction models , 2009, Pattern Recognit..

[50]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[51]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[52]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..