Classification of Interviews - A Case Study on Cancer Patients

With the rapid expansion of Web 2.0, a variety of documents abound online. Thus, it is important to find methods that can annotate and organize documents in meaningful ways to expedite the search process. A considerable amount of research on document classification has been conducted. However, this paper introduces the classification of interviews of cancer patients into several cancer diseases based on the features collected from the corpus. We have developed a corpus of 727 interviews collected from a web archive of medical articles. The TF-IDF features of unigram, bigram, trigram and emotion words as well as the SentiWordNet and Cosine similarity features have been used in training and testing of the classification systems. We have employed three different classifiers like k-NN, Decision Tree and Naive Bayes for classifying the documents into different classes of cancer. The experimental results obtain maximum accuracy of 99.31% tested on 73 documents of the test data.

[1]  T. Danisman,et al.  Feeler: Emotion Classification of Text Using Vector Space Model , 2008 .

[2]  James R. Glass,et al.  Unsupervised Speaker Adaptation based on the Cosine Similarity for Text-Independent Speaker Verification , 2010, Odyssey.

[3]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[4]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  LuYue,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009 .

[6]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[7]  Jun Wen,et al.  Text Categorization Based on a Similarity Approach , 2007 .

[8]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[9]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[10]  T. Gungor,et al.  An evaluation of existing and new feature selection metrics in text categorization , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[11]  Naohiko Uramoto,et al.  A text-mining system for knowledge discovery from biomedical documents , 2004, IBM Syst. J..

[12]  Wenyin Liu,et al.  Term Weighting Schemes for Question Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Chunping Li,et al.  A Novel Term Weighting Scheme for Automated Text Categorization , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[16]  Lukasz A. Kurgan,et al.  Multi-label associative classification of medical documents from MEDLINE , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.