Data analysis of conceptual similarities of Finnish verbs

Data analysis of conceptual similarities of Finnish verbs Krista Lagus (krista.lagus@hut.fi) Neural Networks Research Centre, Helsinki University of Technology P.O.Box 9800, 02015 HUT, Finland Anu Airola (anu.airola@helsinki.fi) Department of General Linguistics, University of Helsinki P.O.Box 9, 00014 University of Helsinki, Finland Mathias Creutz (mathias.creutz@hut.fi) Neural Networks Research Centre, Helsinki University of Technology P.O.Box 9800, 02015 HUT, Finland Abstract The study of the conceptual representations that underlie the use of language is a problem motivated from both a cognitive research point of view and that of construing language models for various lan- guage processing tasks. In this work, we organized 600 Finnish verbs using the SOM algorithm. Three experiments were conducted using different features to encode the verbs: morphosyntactic properties, individual nouns, and noun categories in the con- text of the verb. In general, the morphosyntactic properties seem to draw attention to semantic roles, whereas nouns as features seem to highlight clusters formed on grounds of topics in the text. Introduction Observation of language use provides indirect ev- idence of the representations that humans utilize. The study of conceptual representations that un- derlie the use of language is important for applica- tions such as speech recognition. Due to the redun- dancy in communication, by studying large amounts of data it may be possible to induce the concep- tual, system-internal representations which provide a grounding for meanings of words. Whether this is possible, and if so, how, is an interesting and con- troversial question. A central problem in learning a language or in estimating a language model-1 from data is how to generalize from particular observations to new, sim- ilar instances. Generalization requires knowledge of similarities between words, concepts and other units of language and thought, i.e., similarity representa- tions. The hypothesis that the semantic similarity of two words correlates strongly with the similarity of their contexts has been widely discussed in linguistics and psychology (for recent treatments, see Levin, 1993 and Miller & Charles, 1991). It has been proposed by Gardenfors that a cen- tral part of our conceptual reprensentations are 1For an introduction to statistical language modeling see (Manning & Schiitze, 1999). Their applications in- clude speech recognition, machine translation, and di- alogue agents that converse with humans in order to perform tasks such as answering questions about train schedules and booking flights. grounded in various low-dimensional conceptual spaces. A conceptual space is defined as a set of qual- ity dimensions with a geometrical structure (Gar- denfors, 2000). Examples of conceptual spaces near our perceptual apparatus are colors and the pitch of sounds. For many higher order concepts a geo- metric interpretation can be found, as well. For ex- ample, comparative relations such as ‘longer than’ can be represented as a geometric relation between two elementary length spaces. Gardenfors proposes a subset of concepts called natural concepts: A natural concept is represented as a set of re- gions in a number of domains together with an assignment of salience weights to the domains and information about how the regions in dif- ferent domains are correlated. An inherent and important property of the pro- posed conceptual spaces is that they provide a mean- ing representation that is ordered and offers means for representing similarities, often in terms of some continuous-valued underlying qualities. Gardenfors gives examples of conceptual spaces that humans are likely to have. However, an open research question remains for both brain research and the study of language use: What are the possible conceptual di- mensions that humans utilize‘? In this work we analyze the use of Finnish2 verbs with the following goals in mind: (a) to uncover pos- sible conceptual spaces, i.e., underlying, organizing semantic qualities or properties, (b) to study seman- tic similarities of verbs in actual language use. In particular, we examine the kinds of semantic or con- ceptual ordering qualities that appear to affect the distribution of features in the immediate context of a verb, in particular (1) morphosyntactic properties of nearby words, and (2) the nearby nouns and (3) unsupervised categories of nearby nouns. In effect, we rely on the redundancy in communication and assume that certain regularities observed in the dis- tributions of verb contexts will contain significant information about the semantics of the verb as well. 2Most of the research on language is carried out us- ing English data only, which creates a too narrow or misleading picture of the modeling apparatus underly- ing language learning and use.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Nick Chater,et al.  Distributional Information: A Powerful Cue for Acquiring Syntactic Categories , 1998, Cogn. Sci..

[3]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[4]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[5]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[6]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[7]  Maria Vilkuna,et al.  Free word order in Finnish : its syntax and discourse functions , 1991 .

[8]  Maria Vilkuna,et al.  Free word order in Finnish , 1989 .

[9]  Sabine Schulte im Walde Clustering Verbs Semantically According to their Alternation Behaviour , 2000, COLING.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Esa Alhoniemi,et al.  SOM Toolbox for Matlab 5 , 2000 .

[12]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[13]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[14]  Giovanni Da San Martino Self-Organizing Maps in Natural Language Processing , 2003 .

[15]  Stephen I. Gallant,et al.  HNC's MatchPlus system , 1992, SIGF.

[16]  Marc Light,et al.  Morphological Cues for Lexical Semantics , 1996, ACL.

[17]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[18]  Timo Honkela,et al.  Self-Organizing Maps In Natural Language Processing , 1997 .

[19]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[20]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[21]  T. Kohonen,et al.  Self-organizing semantic maps , 1989, Biological Cybernetics.

[22]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[23]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[24]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .