Getting One's First Million ...Collocations

Many-long-years-of experience in creating a very large database of Russian collocations is summarized. The collocations here described are syntactically connected and semantically compatible pairs of content components(single or multi-words. We begin from a synopsis of various applications of collocation databases (CDBs). Then we describe the main features of collocation components, syntactic types of collocations, and links of other nature between their components that amplify the applicability of the enclosing systems. All of the above-mentioned characterizes the CrossLexica system created for Russian but with a universal structure suited for other languages. The statistics of CrossLexica is given and discussed. It now contains more that a million collocations and more than a million WordNet-like links.

[1]  Alexander F. Gelbukh,et al.  Text Segmentation into Paragraphs Based on Local Text Cohesion , 2001, TSD.

[2]  Igor A. Bolshakov Multifunction Thesaurus For Russian Word Processing , 1994, ANLP.

[3]  Luis Enrique Sucar,et al.  MICAI 2004: Advances in Artificial Intelligence , 2004, Lecture Notes in Computer Science.

[4]  Alexander F. Gelbukh,et al.  Automatic Syntactic Analysis for Detection of Word Combinations , 2004, CICLing.

[5]  Alberto Sanfeliu,et al.  Progress in Pattern Recognition, Speech and Image Analysis , 2003, Lecture Notes in Computer Science.

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Alexander F. Gelbukh,et al.  A Very Large Database of Collocations and Semantic Links , 2000, NLDB.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Alexander F. Gelbukh,et al.  On Detection of Malapropisms by Multistage Collocation Testing , 2003, NLDB.

[10]  I. A. Bolshakov Thesaurus in word processors : what shoult it be? , 1991 .

[11]  Alexander Gelbukh,et al.  Word Sense Disambiguation in a Spanish Explanatory Dictionary , 2001, JEPTALNRECITAL.

[12]  Jeremy J. Carroll,et al.  Automatic Learning for Semantic Collocation , 1992, ANLP.

[13]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[14]  Alexander F. Gelbukh,et al.  Improving Prepositional Phrase Attachment Disambiguation Using the Web as Corpus , 2003, CIARP.

[15]  Alexander F. Gelbukh,et al.  Dictionary-Based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses , 2000, TSD.

[16]  R. Schreuder,et al.  Idioms : structural and psychological perspectives , 1997 .

[17]  Alexander F. Gelbukh,et al.  Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism , 2004, MICAI.

[18]  Alexander F. Gelbukh,et al.  Words Combinations as an Important Part of Modern Electronic Dictionaries , 2002, Proces. del Leng. Natural.

[19]  Alexander F. Gelbukh,et al.  Stable Coordinated Pairs in Text Processing , 2003, TSD.

[20]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[21]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[22]  Emanuele Pianta,et al.  Detecting hidden multiwords in bilingual dictionaries , 2002 .

[23]  Piek Vossen,et al.  EuroWordNet: general document , 2002 .

[24]  Alexander F. Gelbukh,et al.  Heuristics-Based Replenishment of Collocation Databases , 2002, PorTAL.

[25]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[26]  Alexander F. Gelbukh,et al.  Tool for Computer-Aided Spanish Word Sense Disambiguation , 2003, CICLing.