Linguistic features to predict query difficulty - a case study on previous TREC campaigns

Query difficulty can be linked to a number of causes. Some of these causes can be related to the query expression itself, and can therefore be detected through a linguistic analysis of the query text. Using 16 different linguistic features, automatically computed on TREC queries, we looked for significant correlations between these features and the average recall and precision scores obtained by systems. Each of these features can be viewed as a clue to a linguisticallyspecific characteristic, either morphological, syntactical or semantic. Two of these features (syntactic links span and polysemy value) are shown to have a significant impact on either recall or precision scores for previous adhoc TREC campaigns. Although the correlation values are not very high, they indicate a promising link between some linguistic characteristics and query difficulty. 1. CONTEXT This study has been conducted in the context of the ARIEL research project, in which we investigate the impact of linguistic processing in IR systems. The ultimate objective is to build an adaptive IR system, in which several natural language processing (NLP) techniques are available, but are selectively used for a given query, depending on the predicted efficiency of each technique. 2. OBJECTIVE Although linguistics and NLP have been viewed as natural solutions for IR, the overall efficiency of the techniques used in IR systems is doubtful at best. From fine-grained morphological analysis to query expansion based on semantic word classes, the use of linguistically-sound techniques and resources has often been proven to be as efficient as other cruder techniques [5] [8]. In this paper, we consider linguistics as a way to predict query difficulty rather than a means to model IR. 3. RELATED WORK A closely-related approach is the analysis performed by [7] on the CLEF topics. Their intent was to discover if some query features could be correlated to system performance and thus indicate a kind of bias in this evaluation campaign, and further to build a fusion-based IR engine. The linguistic features they used to describe each topic mostly concerned syntactic and word forms aspects, and were calculated by hand. They used a correlation measure between these features and the average precision, but the only significant result was a correlation of 0.4 between the number of proper nouns and average precision. Further studies led the authors to named entities as a useful feature, and they were able to propose a fusion-based model that improved overall precision after a classification of topics according to the number of named entities. The precision increase using this feature varied from 0 to 10%, across several tasks (monoand multi-lingual). Our study deals with more linguistic features, especially in order to deal with syntactic complexity. In addition, we only used automatic analysis methods with NLP techniques. Focusing on documents instead of queries, [6] also used linguistic features in order to characterize documents in IR collections. His main point was to study the notion of relevance, and test whether it could be related to stylistic features, and if the genre of a document could be useful for relevant document selection. [3] also used documents in order to predict query difficulty using a clarity score that depends on both the query and target collection. Both the previous studies therefore need to have exhaustive information on the collection; while we decided to focus on queries only, in order to deal with a wider range of IR situations. In [2] several classes of topic failures were drawn manually, but no elements were given on how to assign automatically a topic to a category. 4. METHOD We selected the following data: TREC 3, 5, 6 and 7 results for the adhoc task; that corresponds to a total of 200 queries (50 per year). Each query in these collections was automatically analysed and described with 16 variables, each corresponding to a specific linguistic feature. We considered the title part of the query as its length and format is the closest to a real user’s query. Because TREC web site makes participants’ runs available (i.e. lists of retrieved documents for each query), it was possible to compute the average recall and precision scores for each run and each query (using the trec-eval utility). We then computed the average recall and precision values over runs for each query. Finally, we computed the correlation between these scores and the linguistic features variables. These correlation values were tested for statistical significance. As a first result, if simple features dealing with the number or size of words in a query or the presence of certain parts of speech do not have clear consequences on a query's difficulty, more sophisticated variables led to interesting results. Globally, the syntactic complexity of a query has a negative impact on the precision scores, and the semantic ambiguity of the query words has a negative impact on the recall scores. A little less significantly, the morphological complexity of words also has a negative effect on recall. 4.1. Linguistic Features The use of linguistic features in order to study a document is a well-known technique. It has been thoroughly used in several NLP tasks, ranging from classification to genre analysis. The principles are quite simple: the text (i.e. query in our case) is first analysed using some generic parsing techniques (e.g. part of speech tagging, chunking, and parsing). Based on the tagged text data, simple programs compute the corresponding information. We used: Tree Tagger for part-of-speech tagging and lemmatisation: this tool attributes a single 1TreeTagger, by H. Schmidt; available at www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ morphosyntactic category to each word in the input text, based on a general lexicon and a language model; Syntex [4] for shallow parsing (syntactic link detection): this analyser identifies syntactic relation between words in a sentence, based on grammatical rules; In addition, we used the following resources: WordNet 1.6 semantic network to compute semantic ambiguity: this database provides, among other information, the possible meanings for a given word; CELEX database for derivational morphology: this resource gives the morphological decomposition of a given word. According to the final objective, which is an automatic classification of queries, all the features considered are computed without any human intervention, and are as such prone to processing errors. The 16 linguistic features we computed are in Table 1, categorized in three different classes according to their level of linguistic analysis: Table 1: List of linguistic features Morphological features : number (#) of words NBWORDS average word length LENGTH average # of morphemes per word MORPH average # of suffixed tokens word SUFFIX average # of proper nouns PN average # of acronyms ACRO average # of numeral values (dates, quantities, etc.) NUM average # of unknown tokens UNKNOWN Syntactical features : average # of conjunctions CONJ average # of prepositions PREP average # of personal pronouns PP average syntactic depth SYNTDEPTH average syntactic links span SYNTDIST