How effective is suffixing?

s and titles from the Cranfield collection (with 225 queries and 1400 documents), comprised the major test collection for this study. The Medlars collection (30 queries and 1033 documents), and the CACM collection (64 queries and 3204 documents) were used to provide information about the variation of stemming performance across different subject areas and test collections. In addition to the standard recall/precision measures, with SMART system averaging (Salton, 1971), several methods more suited to an interactive retrieval environment were adopted. The interactive environment returns lists of the top ranked documents, and allows the users to scan titles of a group of documents a screenful at a time, so that the ranking of individual documents within the screenful is not as important as the total number of relevant titles within a screen. Furthermore, the number of relevant documents in the first few screens is far more important for the user than the number of relevant in the last screenfuls. Three measures were selected which evaluate performance at given rank cutoff points, such as those corresponding to a screenful of document titles. The first measure, the E measure (Van Rijsbergen, 1979), is a weighted combination of recall and precision that evaluates a set of retrieved documents at a given cutoff, ignoring the ranking within that set. The measure may have weights of 0.5, 1.0, and 2.0 which correspond, respectively, to attaching half the importance to recall as to precision, equal importance to both, and double importance to recall. A lower E value indicates a more effective performance. A second measure, the total number of relevant documents retrieved by a given cutoff, was also calculated. Cutoffs of 10 and 30 documents were used, with ten reflecting a minimum number a user might be expected to TABLE 2. Retrieval performance for Cranfteld 225. scan, and 30 being an assumed upper limit of what a user would scan before query modification. The third measure applicable to the interactive environment is the number of queries that retrieve no relevant documents by the given cutoff. This measure is important because many types of query modification techniques, such as relevance feedback, require relevant documents to be in the retrieved set to work well. These measures were all used in Croft (1983) as complementary measures to the standard recall/precision evaluation.

[1]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[2]  A. W. Pratt,et al.  Identification and transformation of terminal morphemes in medical English. , 1969, Methods of information in medicine.

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[5]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[6]  A. W. Pratt,et al.  Identification and transformation of terminal morphemes in medical English part II. , 1969, Methods of information in medicine.

[7]  Kevin P. Jones,et al.  Towards everyday language information retrieval systems via minicomputers , 1979, J. Am. Soc. Inf. Sci..

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  John E. Ulmschneider,et al.  A practical stemming algorithm for online search assistance , 1983 .

[12]  William B. Frakes Term Conflation for Information Retrieval , 1984, SIGIR.

[13]  Donna K. Harman,et al.  An experimental study of factors important in document ranking , 1986, SIGIR '86.

[14]  Donna K. Harman,et al.  IRX: an information retrieval system for experimentation and user applications , 1988, SIGF.

[15]  Peter Willett,et al.  Improving subject retrieval in online catalogues: S. Walker, R.M. Jones. (British Library Research Paper 24). British Library, London (1987) xi + 193 pp. £10. ISBN 0-7123-3129-8. (Distributed by Longwood Publishing Group, Wolfeboro, NH, USA.) , 1988 .