Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?

We seek a knowledge-free method for inducing multiword units from text corpora for use as machine-readable dictionary headwords. We provide two major evaluations of nine existing collocation-finders and illustrate the continuing need for improvement. We use Latent Semantic Analysis to make modest gains in performance, but we show the significant challenges encountered in trying this approach.

[1]  Hinrich Schütze,et al.  Distributed syntactic representations with an application to part-of-speech tagging , 1993, ICNN.

[2]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[3]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[4]  Alexander H. Waibel,et al.  Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition , 1997, EUROSPEECH.

[5]  Mary Elizabeth Stevens,et al.  Statistical Association Methods for Mechanized Documentation. , 1967 .

[6]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  P. Resnik Selectional constraints: an information-theoretic model and its computational realization , 1996, Cognition.

[9]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[10]  Sayori Shimohata,et al.  Retrieving Collocations by Co-Occurrences and Word Order Constraints , 1997, ACL.

[11]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[12]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[13]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[14]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[15]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[16]  Richard Sproat,et al.  Morphology and computation , 1992 .

[17]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[18]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[19]  SmadjaFrank Retrieving collocations from text , 1993 .

[20]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[21]  E. Newport,et al.  WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[22]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[23]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[24]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[25]  Vincent E. Giuliano,et al.  THE INTERPRETATION OF WORD ASSOCIATIONS. , 1963 .

[26]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[27]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[28]  William D. Raymond,et al.  The effects of collocational strength and contextual predictability in lexical production 1 , 1999 .

[29]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[30]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[31]  J. AnneMiller The Balancing act , 1976 .

[32]  J. Ponte USe: A Retargetable Word Segmentation Procedure for Information Retrieval , 1996 .