Automated Extraction of Lexical Meanings from Corpus : A Case Study of Potentialities and Limitations

Large corpora are often consulted by linguists as a knowledge source with respect to lexicon, morphology or syntax. However, there are also several methods of automated extraction of semantic properties of language units from corpora. In the paper we focus on emerging potentialities of these methods, as well as on their identified limitations. Evidence that can be collected from corpora is confronted with the existing models of formalised description of lexical meanings. Two basic paradigms of lexical semantics extraction are briefly described. Their properties are analysed on the basis of several experiments performed on Polish corpora. Several potential applications of the methods, including a system supporting expansion of a Polish wordnet, are discussed. Finally, perspectives on the potential further development are discussed.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Stan Szpakowicz,et al.  Classification-Based Filtering of Semantic Relatedness in Hypernymy Extraction , 2008, GoTAL.

[3]  David R. Dowty,et al.  Word Meaning and Montague Grammar , 1979 .

[4]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[5]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.

[6]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[7]  Maciej Piasecki,et al.  A Wordnet from the ground up , 2009 .

[8]  Maciej Piasecki,et al.  Semantic Similarity Measure of Polish Nouns Based on Linguistic Features , 2007, BIS.

[9]  Maciej Piasecki,et al.  SuperMatrix: a General tool for lexical semantic knowledge acquisition , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[10]  Janina Jaślan,et al.  Kieszonkowy słownik angielsko-polski, polsko-angielski , 1981 .

[11]  Magnus Sahlgren,et al.  Vector-based semantic analysis: representing word meanings based on random labels , 2001 .

[12]  Marti A. Hearst Automated Discovery of WordNet Relations , 2004 .

[13]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[14]  David R. Dowty,et al.  Introduction to Montague semantics , 1980 .

[15]  Iryna Gurevych,et al.  Automatically Creating Datasets for Measures of Semantic Relatedness , 2006, ACL 2006.

[16]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[17]  Maciej Piasecki,et al.  Automatic acquisition of wordnet relations by the morpho-syntactic patterns extracted from the corpora in Polish , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[18]  Maciej Piasecki,et al.  Words, Concepts and Relations in the Construction of Polish WordNet , 2008 .

[19]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[20]  Maciej Piasecki,et al.  Extended Similarity Test for the Evaluation of Semantic SimilarityFunctions , 2007 .