Explorative Analysis of Compact Representation of Imbalanced Text Documents for Effective Classification

The flood of text data over the internet day today, posing many challenges in the text classification system. The enormous amount of text data leads to high computation cost due to its volume and large dimension. The text representation is one such challenge; it strikes the performance of the text classification system. Hence there is a great need in the reduction of text volume and its dimension to improve the text classification system. In this paper, we have done an empirical study on representing the text in a compact form in two ways. The first one with sample reduction first and followed by dimensionality reduction by employing a   feature selection method. The second method by using feature reduction based on feature transformation method and fallowed by sample reduction using clustering technique. The data is represented in the form of symbolic representation and also without symbolic representation to compare the effectiveness of compact form. The symbolic classifier and SVM classifiers are employed to classify the text documents. The efficiency of the proposed model is evaluated using performance measures on two benchmark datasets such as Reuters-21578 and TDT2.