An analysis of ill-formed input in natural language queries to document retrieval systems

Abstract We analyzed natural language document retrieval queries from the Thomas Cooper Library at the University of South Carolina in order to investigate the frequency of various types of ill-formed input, such as spelling errors, co-occurrence violations, conjunctions, ellipsis and missing or incorrect punctuation. The primary reason for analyzing ill-formed inputs was to determine whether there is a significant need to study ill-formed inputs in detail. After analyzing the queries, we found that most of the queries were sentence fragments and that many of them contained some type of ill-formed input. Conjunctions caused the most problems. The next most serious problem was caused by punctuation errors. Spelling errors occurred in a small number of the queries. The remaining types of ill-formed input considered, ellipsis and co-occurrence violations, were not found in the queries.

[1]  Tamas E. Doszkocs,et al.  CITE NLM: natural-language searching in an online catalog , 1983 .

[2]  James C. Bezdek,et al.  Knowledge-assisted document retrieval. I: The natural-language interface , 1987 .

[3]  James C. Bezdek,et al.  Knowledge-assisted document retrieval: I. The natural-language interface , 1987, J. Am. Soc. Inf. Sci..

[4]  Helen M. Brooks,et al.  Plexus-the expert system for referral , 1987, Inf. Process. Manag..

[5]  James C. Bezdek,et al.  Knowledge-assisted document retrieval. II: The retrieval process , 1987 .

[6]  Norman K. Sondheimer,et al.  Meta-Rules as a Basis for Processing III-Formed Input , 1983, Am. J. Comput. Linguistics.

[7]  Stephanie W. Haas,et al.  Constituent object parsing for information retrieval and similar text processing problems , 1989, JASIS.

[8]  Norman K. Sondheimer,et al.  Relaxation Techniques for Parsing Grammatically III-Formed Input in Natural Language Understanding Systems , 1981, Am. J. Comput. Linguistics.

[9]  Yves Chiaramella,et al.  A prototype of an intelligent system for information retrieval: IOTA , 1987, Inf. Process. Manag..

[10]  C. M. Eastman,et al.  On the Need for Parsing Ill-Formed Input , 1981, CL.

[11]  Yorick Wilks,et al.  Preference Semantics, III-Formedness, and Metaphor , 1983, Am. J. Comput. Linguistics.

[12]  Linda Fineman,et al.  Questioning the Need for Parsing Ill-formed Inputs , 1983, CL.

[13]  Antonio Zamora,et al.  Collection and characterization of spelling errors in scientific and scholarly text , 1983, J. Am. Soc. Inf. Sci..

[14]  Roger Mitton,et al.  Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[15]  Alan W. Biermann,et al.  The Correction of Ill-Formed Input Using History-Based Expectation with Applications to Speech Understanding , 1986, Comput. Linguistics.

[16]  Lance A. Miller,et al.  Parse Fitting and Prose Fixing , 1993, Natural Language Processing.

[17]  Richard Granger The NOMAD System: Expectation-Based Detection and Correction of Errors During Understanding of Syntactically and Semantically III-Formed Text , 1983, Am. J. Comput. Linguistics.

[18]  R.M. Weischedel,et al.  Knowledge representation and natural language processing , 1986, Proceedings of the IEEE.

[19]  Stephanie W. Haas,et al.  Conjunction, ellipsis, and other discontinuous constituents in the constituent object parser , 1990, Inf. Process. Manag..

[20]  B. H. Thompson,et al.  Linguistic Analysis of Natural Language Communication With Computers , 1980, COLING.

[21]  Christine A. Montgomery,et al.  Linguistics and information science , 1972, J. Am. Soc. Inf. Sci..

[22]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[23]  Ralph Grishman Natural language processing , 1984, J. Am. Soc. Inf. Sci..

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Tamas E. Doszkocs Natural language processing in information retrieval , 1986 .