Summarization as Feature Selection for Arabic Text Classification

Text classification (TC) or text categorization task is assigning a document to one or more predefined classes or categories. A common problem in TC is the high number of terms or features in document(s) to be classified (the curse of dimensionality). This problem can be solved by selecting the most important terms. In this study, an automatic text summarization is used for feature selection. Since text summarization is based on identifying the set of sentences that are most important for the overall understanding of document(s). We address the effectiveness of using summarization techniques on text classification. Another feature selection technique is used, which is Term Frequency (TF) on the same but full-text data set, i.e., before summarization. Support Vector Machine is used to classify our Arabic data set. The classifier performance is evaluated in terms of classification accuracy, precision, recall, and the execution time. Finally, a comparison is held between the results of classifying full documents and summarized documents. Keywords-Text Categorization; Text Summarization; Support Vector Machine; Feature Selection.