论文信息 - Towards modernised and Web-specific stoplists for Web document analysis

Towards modernised and Web-specific stoplists for Web document analysis

Research areas such as text classification and document clustering underpin many issues in Web intelligence. A fundamental tool in document clustering is a list of 'stop' words (stoplist) that is used to identify frequent words that are unlikely to assist in classification and is hence removed during pre-processing. Current stoplists are outdated both in light of fluctuations in word usage, and innocent of 'Web-specific' stop words, hence questioning their applicability in Web-based tasks. We explore this by developing new word-entropy based stoplists: one derived from random Web pages, and one derived from the BankSearch dataset. We evaluate these against other stoplists using accuracies obtained from unsupervised clustering experiments. We find that existing stoplists perform well, but are sometimes outperformed by our new stoplists, especially on hard classification tasks.

David W. Corne | Mark P. Sinka | D. Corne | M. P. Sinka

[1] Lynn A. Streeter,et al. Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval , 1989, Inf. Process. Manag..

[2] Andreas S. Weigend,et al. A neural network approach to topic spotting , 1995 .

[3] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[4] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5] David D. Lewis,et al. A comparison of two learning algorithms for text categorization , 1994 .

[6] Donna K. Harman,et al. An experimental study of factors important in document ranking , 1986, SIGIR '86.

[7] George W. Hart. To decode short cryptograms , 1994, CACM.

[8] Mark P. Sinka,et al. A Large Benchmark Dataset for Web Document Clustering , 2002 .

[9] David D. Lewis,et al. An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[10] Christopher J. Fox,et al. Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[11] John M. Pierre,et al. Practical Issues for Automated Categorization of Web Sites , 2000 .