Comparison of feature selection techniques in classifying stroke documents

The amount of digital biomedical literature grows that make most of the researchers facing the difficulties to manage and retrieve the required information from the Internet because this task is very challenging. The application of text classification on biomedical literature is one of the solutions in order to solve problem that have been faced by researchers but managing the high dimensionality of data being a common issue on text classification. Therefore, the aim of this research is to compare the techniques that could be used to select the relevant features for classifying biomedical text abstracts. This research focus on Pearson’s Correlation and Information Gain as feature selection techniques for reducing the high dimensionality of data. Towards this effort, we conduct and evaluate several experiments using 100 abstract of stroke documents that retrieved from PubMed database as datasets. This dataset underwent the text pre-processing that is crucial before proceed to feature selection phase. Features selection phase is involving Information Gain and Pearson Correlation technique. Support Vector Machine classifier is used in order to evaluate and compare the effectiveness of two feature selection techniques. For this dataset, Information Gain has outperformed Pearson’s Correlation by 3.3%. This research tends to extract the meaningful features from a subset of stroke documents that can be used for various application especially in diagnose the stroke disease.

[1]  Alper Kursat Uysal,et al.  The impact of feature selection on medical document classification , 2016, 2016 11th Iberian Conference on Information Systems and Technologies (CISTI).

[2]  S. Dinakaran,et al.  Role of Attribute Selection in Classification Algorithms , 2013 .

[3]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[4]  R. Parimala,et al.  A Study of Spam E-mail classification using Feature Selection package , 2011 .

[5]  R. Ramya,et al.  Effective Pre-Processing Activities in Text Mining using Improved Porter's Stemming Algorithm , 2013 .

[6]  Thu Zar Phyu,et al.  Performance Comparison of Feature Selection Methods , 2016 .

[7]  S. W. Mohod,et al.  A Review on Feature Selection and Document Classification using Support Vector Machine , 2014 .

[8]  Anshika Singh,et al.  Text Mining: A Burgeoning technology for knowledge extraction , 2013 .

[9]  T. Velmurugan,et al.  Empirical Study of Feature Selection Methods for High Dimensional Data , 2016 .

[10]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[11]  Beatriz de la Iglesia,et al.  Survey on Feature Selection , 2015, ArXiv.

[12]  Periklis Andritsos,et al.  Overview and semantic issues of text mining , 2007, SGMD.

[13]  Deipali Gore,et al.  A Survey on Text Classification with Different Types of Classification Methods , 2015 .

[14]  Hitoshi Isahara,et al.  A Probabilistic Approach to Feature Selection for Multi-class Text Categorization , 2007, ISNN.

[15]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[16]  Xiaowei Xu,et al.  Knowledge Discovery in Textual Databases: A Concept-Association Mining Approach , 2009 .

[17]  Ashkan Golshani,et al.  Short Co-occurring Polypeptide Regions Can Predict Global Protein Interaction Maps , 2012, Scientific Reports.

[18]  Nazlia Omar,et al.  A Comparative Study of combined Feature Selection Methods for Arabic Text Classification , 2014, J. Comput. Sci..