Improving the text classification using clustering and a novel HMM to reduce the dimensionality

In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  K. Bretonnel Cohen,et al.  Concept Recognition and the TREC Genomics Tasks , 2005, TREC.

[3]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[4]  Soran Saeed,et al.  Evaluating e-Government Services in Kurdistan Institution for Strategic Studies and Scientific Research Using the EGOVSAT Model , 2016, ArXiv.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Robert P. W. Duin,et al.  Dissimilarity representations allow for building good classifiers , 2002, Pattern Recognit. Lett..

[7]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[8]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[9]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[10]  María Lourdes Borrajo Diz,et al.  TCBR-HMM: An HMM-based text classifier with a CBR system , 2015, Appl. Soft Comput..

[11]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[12]  Mark Stamp,et al.  A Revealing Introduction to Hidden Markov Models , 2017 .

[13]  Tsimboukakis Nikolaos,et al.  Document classification system based on HMM word map , 2008, CSTST 2008.

[14]  Mário A. T. Figueiredo,et al.  Similarity-based classification of sequences using hidden Markov models , 2004, Pattern Recognit..

[15]  Harun Uğuz,et al.  Biomedical system based on the Discrete Hidden Markov Model using the Rocchio-Genetic approach for the classification of internal carotid artery Doppler signals , 2011, Comput. Methods Programs Biomed..

[16]  Harun Uguz,et al.  A hybrid system based on information gain and principal component analysis for the classification of transcranial Doppler signals , 2012, Comput. Methods Programs Biomed..

[17]  Kairong Li,et al.  Research on Hidden Markov Model-based Text Categorization Process , 2011 .

[18]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[23]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[24]  María Lourdes Borrajo Diz,et al.  T-HMM: A Novel Biomedical Text Classifier Based on Hidden Markov Models , 2014, PACBB.

[25]  I. Jolliffe Principal Component Analysis , 2002 .

[26]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.