A Comparison of Group and Individual Performance Among Subject Experts and Untrained Workers at the Document Retrieval Task

Useful retrieval depends on the ability to predict which the judgments by which all methods of retrieval must be documents a user will find helpful in answer to a query. rated and also hold the key to improved retrieval perforOur interest is the common case when no information mance. Over the past several years and as a by-product is provided about the user other than the query and the of work on a retrieval system that we use for MEDLINE query is in natural language. In this setting it is well acrecord processing and retrieval, we have accumulated two cepted that a human can make useful predictions in the form of judgments about what will likely prove useful to test sets of MEDLINE documents in the area of molecular another human. We present data showing that when the biology with multiple human judgments of relevance predictions of a group of humans are averaged, the remade by people knowledgeable in the field. In a recent sult is a better predictor. If performance is measured as study of this data (Wilbur, 1996) we found that a panel a precision, the group performance increases with the of judges whose votes are weighted equally is able to size of the group and approaches a limit of approximately 50% improvement over average individual perforpredict better than an individual, what are useful documance on our data. Superior performance by groups ments in answer to a query. In that work it was argued raises the question of how. The groups we studied were that because the panel out-performed the individuals, the subject experts and a natural question was whether the basis for the panel’s superiority could not be common superior performance resulted from the pooling of their sense or common linguistic abilities which we virtually subject knowledge. In order to answer this question we studied also a group of untrained individuals. To our surall possess uniformly. Such could not place the group prise we found that while untrained individuals had a significantly ahead of the individual. We hypothesized somewhat inferior performance compared to trained inthat the group’s superior performance must be a consedividuals, the group of untrained individuals together quence of detailed subject knowledge in the area of the performed better than any single trained individual and documents involved, which is possessed by the members almost at the level of the trained group. of the group but not uniformly by any one member. This hypothesis seemed to point in a plausible direction for Introduction progress, namely, build detailed subject knowledge into the system if you want to improve retrieval. While improved general algorithms for answering natHere we report the results of a further study which ural language queries have been a goal of IR research, contradicts the knowledge hypothesis just described. In relatively simple key-term weighting schemes have rethis study a panel of judges without training or backmained among the top performers (Norvig, 1994; Salton, ground in molecular biology performed the same judg1991) and the difficulty of finding more effective methods ment tasks as the previous panel whose members were has been frequently commented (Croft, 1993; Lewis & most highly trained in molecular biology. A comparison Jones, 1996; Norvig, 1994; Salton, 1991; Sembok & van of the two panels shows that the untrained panel performs Rijsbergen, 1990). Because humans are the only compebetter than any one of the members of the trained panel tent practitioners of natural language, they both provide and almost at the level of the trained panel as a whole. The conclusion thus seems unavoidable that the knowledge Received February 3, 1997; revised April 16, 1997; accepted May 20, necessary to improve current retrieval methods need not 1997. include detailed subject knowledge. Further, it raises the question whether even highly trained individuals use deq 1998 John Wiley & Sons, Inc.