SuperMatrix: a General tool for lexical semantic knowledge acquisition

The paper presents the supermatrix system, which was designed as a general tool supporting automatic acquisition of lexical semantic relations from corpora. The construction of the system is discussed, but also examples of different applications showing the potential of supermatrix are given. The core of the system is construction of co-incidence matrices from corpora written in any natural language as the system works on UTF-8 encoding and possesses modular construction. Supermatrix follows the general scheme of distributional methods. Many different matrix transformations and similarity computation methods were implemented in the system. As a result the majority of existing measures of semantic relatedness were re-implemented in the system. The system supports also evaluation of the extracted measures by the tests originating from the idea of the WordNet Based Synonymy Test. In the case of Polish, SuperMatrix includes the implementation of the language of lexico-syntactic constraints delivering means for a kind of shallow syntactic processing. SuperMatrix processes also multiword expressions as lexical units being described and elements of the description. Processing can be distributed as a number of matrix operations were implemented. The system serves huge matrices.

[1]  Stan Szpakowicz,et al.  Automatic Selection of Heterogeneous Syntactic Features in Semantic Similarity of Polish Nouns , 2007, TSD.

[2]  M. Piasecki,et al.  Polish tagger TaKIPI: rule based construction and optimization , 2007 .

[3]  Stan Szpakowicz,et al.  Classification-Based Filtering of Semantic Relatedness in Hypernymy Extraction , 2008, GoTAL.

[4]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[5]  James R. Curran,et al.  Scaling Context Space , 2002, ACL.

[6]  Graeme Hirst,et al.  Near-Synonymy and Lexical Choice , 2002, CL.

[7]  Maciej Piasecki,et al.  Extended Similarity Test for the Evaluation of Semantic SimilarityFunctions , 2007 .

[8]  Maciej Piasecki,et al.  Semantic Similarity Measure of Polish Nouns Based on Linguistic Features , 2007, BIS.

[9]  Stan Szpakowicz,et al.  Corpus-based Semantic Relatedness for the Construction of Polish WordNet , 2008, LREC.

[10]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[11]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[12]  Dominic Widdows,et al.  Unsupervised methods for developing taxonomies by combining syntactic and statistical information , 2003, NAACL.

[13]  Dominic Widdows,et al.  Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application , 2008, LREC.

[14]  Maciej Piasecki,et al.  Recognition of Structured Collocations in An Inflective Language , 2008 .

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Ido Dagan,et al.  Feature Vector Quality and Distributional Similarity , 2004, COLING.

[17]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[18]  Maciej Piasecki,et al.  Words, Concepts and Relations in the Construction of Polish WordNet , 2008 .

[19]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[20]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[21]  Ted Pedersen,et al.  SenseClusters - Finding Clusters that Represent Word Senses , 2004, AAAI.

[22]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[23]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[24]  Iryna Gurevych,et al.  Automatically Creating Datasets for Measures of Semantic Relatedness , 2006, ACL 2006.

[25]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[26]  Maciej PIASECKI LSA BASED EXTRACTION OF SEMANTIC SIMILARITY FOR POLISH , 2006 .

[27]  Maciej Piasecki,et al.  Correction of Medical Handwriting OCR Based on Semantic Similarity , 2007, IDEAL.

[28]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[29]  Maciej Piasecki,et al.  Experiments in Documents Clustering for the Automatic Acquisition of Lexical Semantic Networks for Polish , 2008 .

[30]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[31]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[32]  Edmond Chow,et al.  New Experiments in Distributional Representations of Synonymy , 2005, CoNLL.

[33]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[34]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[35]  Maciej Piasecki,et al.  Polish WordNet on a Shoestring , 2007 .

[36]  David J. Weir,et al.  Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[37]  Stan Szpakowicz,et al.  Sense-based clustering of Polish nouns in the extraction of semantic relatedness , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[38]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[39]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[40]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .