The Monte Carlo feature selection and interdependency discovery is unbiased

We show that the Monte Carlo feature selection al- gorithm for supervised classification proposed, by Draminski et al. (2008), is not biased towards features with many categories (levels or values). While the algorithm, later extended to include the func- tionality of discovering interdependencies between features, is sur- prisingly simple and has been successfully used on many biological data and transactional data of commercial origin, and it has never revealed any bias of the type mentioned, the alleged property of its unbiasedness required a closer scrutiny which is thus provided here. Admittedly, the algorithm does reveal some bias coming from another source, but it is negligible. Hence our final claim is that the algorithm is practically unbiased and the results it provides can be considered fully reliable. Keywords: supervised classification, feature selection, fea- ture interactions, high-dimensional problems, applications to ge- nomic and proteomic data.

[1]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[2]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[3]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[4]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[5]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[6]  Marcin Kierczak From Physicochemical Features to Interdependency Networks : A Monte Carlo Approach to Modeling HIV-1 Resistome and Post-translational Modifications , 2009 .

[7]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[8]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[9]  Sabine Van Huffel,et al.  Bagging Linear Sparse Bayesian Learning Models for Variable Selection in Cancer Diagnosis , 2007, IEEE Transactions on Information Technology in Biomedicine.

[10]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[11]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[12]  Xiaohui Liu,et al.  Combining multiple classifiers for wrapper feature selection , 2008, Int. J. Data Min. Model. Manag..

[13]  Jan Komorowski,et al.  A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome , 2009, Bioinformatics and biology insights.

[14]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Jan Komorowski,et al.  Monte Carlo Feature Selection and Interdependency Discovery in Supervised Classification , 2010, Advances in Machine Learning II.

[16]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[17]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[18]  Michał Dramiński,et al.  Computational Analysis of Molecular Interaction Networks Underlying Change of HIV-1 Resistance to Selected Reverse Transcriptase Inhibitors , 2010, Bioinformatics and biology insights.