Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm

BACKGROUND AND OBJECTIVE Healthcare tweets are particularly challenging due to its sparse layout and its limited character size. Compared to previous method based on "bag of words" (BOW) model, this study uniquely identifies the enrichment protocol and learns how semantically different aspects of feature selection such as BOW (feature F0), term frequency inverse document frequency (TF-IDF, feature F1), and latent semantic indexing (LSI, feature F2) when applied sequentially with classifier improves the overall performance. METHODS To study this enrichment concept, our ML model is tested on two kinds of diverse data sets: (i) D1: Disease data with conjunctivitis, diarrhea, stomach ache, cough and nausea related tweets, and (ii) D2: WebKB4 dataset, while adapting three kind of classifiers (a) C1: support vector machine with radial basis function (SVMR), (b) C2: Multi-layer perceptron (MLP) and (c) C3: Random Forest (RF). Partition protocol (K10) was adapted with different performance metrics to evaluate machine learning (ML)-system. RESULTS Using the combination of F1, C1, D1, K10, ML accuracy was: 94%, while with F2, C1, D1, K10, ML accuracy was 97%. Using the incremental feature enrichment from F0 to F2, K10 protocol gave F1 improvement over F0 by 4.98% on Disease dataset, while F2 improvement over F0 was by 11.78% on WebKB4 dataset. We demonstrated the generalization over memorization process in our ML-design. The system was tested for stability and reliability. CONCLUSIONS We conclude that semantically different aspects of feature selection, when adapted sequentially, leads to improvement in ML-accuracy for healthcare data sets. We validated the system by taking non-healthcare data sets.

[1]  Craig MacDonald,et al.  Tweet Enrichment for Effective Dimensions Classification in Online Reputation Management , 2015, ICWSM.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  V Korde,et al.  TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY , 2012 .

[4]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[5]  Jia Wu,et al.  A Correlation-Based Feature Weighting Filter for Naive Bayes , 2019, IEEE Transactions on Knowledge and Data Engineering.

[6]  Eduard Hoenkamp Trading Spaces: On the Lore and Limitations of Latent Semantic Analysis , 2011, ICTIR.

[7]  Petia Radeva,et al.  Wall-based measurement features provides an improved IVUS coronary artery risk assessment when fused with plaque texture-based features during machine learning paradigm , 2017, Comput. Biol. Medicine.

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[10]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[11]  Paola Velardi,et al.  Twitter mining for fine-grained syndromic surveillance , 2014, Artif. Intell. Medicine.

[12]  Marcel Salathé,et al.  An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages , 2014, J. Biomed. Informatics.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Jasjit S. Suri,et al.  Healthcare Text Classification System and its Performance Evaluation: A Source of Better Intelligence by Characterizing Healthcare Text , 2018, Journal of Medical Systems.

[15]  Dawn Carmichael,et al.  How effective is social media advertising? A study of Facebook Social Advertisements , 2012, 2012 International Conference for Internet Technology and Secured Transactions.

[16]  Jyh-Jian Sheu,et al.  An efficient spam filtering method by analyzing e-mail’s header session only , 2009 .

[17]  Shasha Wang,et al.  Deep feature weighting for naive Bayes and its application to text classification , 2016, Eng. Appl. Artif. Intell..

[18]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[19]  Brian Moon,et al.  Automated text classification using a dynamic artificial neural network model , 2012, Expert Syst. Appl..

[20]  Wei Wei,et al.  LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier , 2008, RSKT.

[21]  Alok N. Choudhary,et al.  Twitter Trending Topic Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[22]  S. T. Rosenbloom,et al.  A Scalable Framework to Detect Personal Health Mentions on Twitter , 2015, Journal of Medical Internet Research.

[23]  Jasjit S. Suri,et al.  Computer-aided diagnosis of psoriasis skin images with HOS, texture and color features: A first comparative study of its kind , 2016, Comput. Methods Programs Biomed..

[24]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[25]  Xijin Tang,et al.  TFIDF, LSI and multi-word in information retrieval and text categorization , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[26]  Kai Xing,et al.  BOWL: Bag of Word Clusters Text Representation Using Word Embeddings , 2016, KSEM.

[27]  Isa Maks,et al.  A lexicon model for deep sentiment analysis and opinion mining applications , 2012, Decis. Support Syst..

[28]  Selma Ayse Ozel A Web page classification system based on a genetic algorithm using tagged-terms as features , 2011 .

[29]  Arthur C. Graesser,et al.  Strengths, Limitations, and Extensions of LSA , 2007 .

[30]  Xiaohua Hu,et al.  Web clustering based on the information of sibling pages , 2008, 2008 IEEE International Conference on Granular Computing.

[31]  Azuraliza Abu Bakar,et al.  Text associative classification approach for mining Arabic data set , 2012, 2012 4th Conference on Data Mining and Optimization (DMO).

[32]  Walter Daelemans,et al.  Personae: a Corpus for Author and Personality Prediction from Text , 2008, LREC.

[33]  Cornelia Caragea,et al.  Document Type Classification in Online Digital Libraries , 2016, AAAI.

[34]  Doreswamy,et al.  Hybrid Data Mining Technique for Knowledge Discovery from Engineering Materials Data Sets , 2011 .

[35]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[36]  Jian Yu,et al.  A multi-layer text classification framework based on two-level representation model , 2012, Expert Syst. Appl..