On the role of words and phrases in automatic text analysis

One of the most crucial operations in automatic information retrieval is the assignment to written texts and documents of appropriate identifiers, capable of representing information content for search and retrieval purposes. This operation known as automatic indexing normally consists in assigning to the documents either single terms, or more specific entities such as phrases, or more general entities such as term classes. A model, known as discrimination value analysis is introduced which assigns an appropriate role in the indexing operation to the terms, term phrases, and thesaurus classes. The model is used to determine effectiveness criteria for the content identifiers and to generate useful indexing policies. Experimental evidence is given to validate the theory.