Comparing frequency of word occurrences in abstracts and texts using two stop word lists

Retrieval tests have assumed that the abstract is a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval. 1,138 articles with their abstracts were downloaded from Journal of the American Medical Association, New England Journal of Medicine, the British Medical Journal, and the Lancet. Based on two stop word lists, one long and one short, content bearing words were extracted from the articles and their abstracts and the frequency of each word was counted in both sources. Each article and its abstract were tested using a chi-squared test to determine if the words in the abstract occurred as frequently as would be expected. 96% to 98% of the abstracts tested were not significantly different than random samples of the articles they represented. In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the articles they represent.