论文信息 - Little words can make a big difference for text classification

Little words can make a big difference for text classification

Most information retrieval systems use stopword lists and stemming algorithms. However, we have found that recognizing singular and plural nouns, verb forms, negation, and prepositions can produce dramatically different text classification results. We present results from text classification experiments that compare relevancy signatures, which use local linguistic context, with corresponding indexing terms that do not. In two different domains, relevancy signatures produced better results than the simple indexing terms. These experiments suggest that stopword lists and stemming algorithms may remove or conflate many words that could be used to create more effective indexing terms.

Ellen Riloff | E. Riloff

[1] Martin Dillon,et al. FASIT: A fully automatic syntactically based indexing system , 1983, J. Am. Soc. Inf. Sci..

[2] Wendy G. Lehnert,et al. Symbolic/Subsymbolic Sentence Analysi: Exploiting the Best of Two Worlds , 1988 .

[3] Joel L. Fagan. The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[4] Joel L. Fagan,et al. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[5] Donna K. Harman,et al. How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[6] W. Bruce Croft,et al. The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[7] L. R. Rasmussen,et al. In information retrieval: data structures and algorithms , 1992 .

[8] Donna K. Harman,et al. The DARPA TIPSTER project , 1992, SIGF.

[9] Ellen Riloff,et al. Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[10] Robert Krovetz,et al. Viewing morphology as an inference process , 1993, Artif. Intell..

[11] Ellen Riloff. Information extraction as a basis for portable text classification systems , 1994 .

[12] Ellen Riloff,et al. Information extraction as a basis for high-precision text classification , 1994, TOIS.