Abstract It is generally agreed that natural language database query systems need to be tailored for each application in which they are used. This process, often called customization , involves, among other things, specification of the domain vocabulary and the grammatical properties of that vocabulary. In a previous paper, techniques for automatically generating domain vocabularies from large text collections were discussed. This paper is a continuation of that line of research, focusing on the problem of generating multi-word vocabulary terms (specifically pairs). It also discusses some of the statistical issues associated with word co-occurrences likely to be of use in a natural language interface for the application in question. Most importantly, an attempt is made to provide a more objective evaluation of the selection procedures used. Absent substantial experimentation with subjects using a working query system, all evaluation is necessarily subjective. This paper uses a surrogate for experimentation by relying on pre-existing dictionaries as indicators of domain relevance.
[1]
Stephen P. Harter,et al.
A probabilistic approach to automatic keyword indexing
,
1974
.
[2]
Gerald Salton,et al.
Automatic text processing
,
1988
.
[3]
Gerard Salton,et al.
Automatic text structuring and retrieval-experiments in automatic encyclopedia searching
,
1991,
SIGIR '91.
[4]
H. P. Edmundson,et al.
Automatic abstracting and indexing—survey and recommendations
,
1961,
CACM.
[5]
Frank A. Smadja,et al.
Lexical Co-occurrence: The Missing Link
,
1989
.
[6]
S. Siegel,et al.
Nonparametric Statistics for the Behavioral Sciences
,
2022,
The SAGE Encyclopedia of Research Design.
[7]
Paul Procter,et al.
Longman Dictionary of Contemporary English
,
1978
.
[8]
Fred J. Damerau.
Evaluating computer-generated domain-oriented vocabularies
,
1990,
Inf. Process. Manag..
[9]
Fred J. Damerau.
Problems and some solutions in customization of natural language database front ends
,
1985,
TOIS.