Xtract: An overview

Lexical collocations have particular statistical distributions. We have developed a set of statistical techniques for retrieving and identifying collocations from large textual corpora. The techniques we developed are able to identify collocations of arbitrary length as well as flexible collocations. These techniques have been implemented in a lexicographic tool, Xtract, which is able to automatically acquire collocations with high retrieval performance. Xtract works in three stages. The first stage is based on a statistical technique for identifying word pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word collocations (or n-grams) in a much simpler way than related methods. These collocations can involve closed class words such as particles and prepositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pair in order to identify the proper syntactic relation between the two words. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher quality output. In this paper we present an overview of Xtract and we describe several uses for Xtract and the knowledge it retrieves such as language generation and machine translation.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Robert F. Ilson,et al.  The BBI Combinatory Dictionary of English: A guide to word combinations , 1989 .

[3]  M. Benson The Structure of the Collocational Dictionary , 1989 .

[4]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[5]  Steven Abney Rapid Incremental Parsing with Repair , 1990 .

[6]  M. Benson,et al.  Collocations and General-purpose Dictionaries , 1990 .

[7]  Karen Kukich Knowledge-Based Report Generation: a technique for automatically generating natural language reports from databases , 1983, SIGIR 1983.

[8]  Martin Kay Functional Unification Grammar: a formalism for machine translation , 1984 .

[9]  Paul H. Klingbiel Machine-aided indexing of technical literature , 1973, Inf. Storage Retr..

[10]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[11]  Frank A. Srnad ja,et al.  From N-Grams to Collocations: An Evaluation of Xtract , 1991, ACL.

[12]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[13]  Yoelle Maarek,et al.  An incremental conceptual clustering algorithm that reduces input-ordering bias , 1990 .

[14]  Kenneth Ward Church,et al.  Parsing, Word Associations and Typical Predicate-Argument Relations , 1989, HLT.

[15]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[16]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[17]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[18]  Michael Elhadad,et al.  Types in Functional Unification Grammars , 1990, ACL.

[19]  Karen Kukich,et al.  Knowledge-based report generation : a knowledge engineering approach to natural language report generation , 1983 .

[20]  Kathleen R. McKeown,et al.  Using collocations for language generation 1 , 1991 .

[21]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[22]  Kathleen McKeown,et al.  Automatically Extracting and Representing Collocations for Language Generation , 1990, ACL.

[23]  Yoelle Maarek,et al.  Full text indexing based on lexical relations an application: software libraries , 1989, SIGIR '89.

[24]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[25]  Stuart Berg Flexner,et al.  The Random House Dictionary of the English Language and the Poetry of Tina Darragh , 2020, Dictionary Poetics.

[26]  Karen Spärck Jones,et al.  Automatic Search Term variant Generation , 1984, J. Documentation.