论文信息 - A simple KNN algorithm for text categorization

A simple KNN algorithm for text categorization

Text categorization (also called text classification) is the process of identifying the class to which a text document belongs. This paper proposes to use a simple non-weighted features KNN algorithm for text categorization. We propose to use a feature selection method that finds the relevant features for the learning task at hand using feature interaction (based on word interdependencies). This will allow us to reduce considerably the number Of selected features from which to learn, making our KNN algorithm applicable in contexts where both the volume of documents and the size of the vocabulary are high, like with the World Wide Web. Therefore, the KNN algorithm that we propose becomes efficient for classifying text documents in that context (in terms of its predictability and interpretability), as is demonstrated. Its simplicity (WRT its implementation and fine-tuning) becomes its main assets for in-the-field applications.

Guy W. Mineau | Pascal Soucy | P. Soucy | G. Mineau

[1] Thorsten Joachims,et al. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[2] Georgios Paliouras,et al. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[3] Guy W. Mineau,et al. A Simple Feature Selection Method for Text Classification , 2001, IJCAI.