Abstract: Semantic clustering is important to various fields in the modern information society. In this work we applied the Independent Component Analysis method to the extraction of the features of latent concepts. We used verb and object noun information and formulated a concept as a linear combination of verbs. The proposed method is shown to be suitable for our framework and it performs better than a hierarchical clustering in latent semantic space for finding out invisible information from the data. Keywords: Independent Component Analysis, Clustering, Latent Concepts. 1. Introduction Our purpose in this work is to extract features of concepts from verb and noun contexts. These extracted verb concepts benefit many applications in natural language processing, such as word sense disambiguation and automatic thesaurus building, etc. In this scope, the application of an Independent Component Analysis (ICA) is proposed, which is a widely used method in signal processing, especially in blind source separation. However some contributions have been made in the text processing area for document analysis and linguistic feature extraction. As a brief word about these contributions, in document analysis studies (such as [2] and [3]) a term by document matrix is considered as the linear mixtures of a set of independent sources. The semantic spaces built in these works take the form of non-orthogonal term-occurrence histograms. For the linguistic feature extraction work in [4], the contextual information of words in a raw corpus is used to extract the linguistic features of the words. Our contribution in this work differs from these previous studies, in the sense of (1) using a verb-noun pattern and (2) modeling from the “concept” view in the ICA, despite having different purposes. As mentioned above our model uses verb-noun information for a cognitive task concerning latent concepts and noun membership for those concepts. We formulate a concept as a linear combination of verbs/predicates. The ICA decomposes the verb-noun matrix into two parts, and thus the first part shows the weights of the verbs in the latent concepts and the second part gives the weights of the nouns in those concepts. See Fig. 2 for a more detailed illustration. In the following section we give a brief introduction to the ICA, and then model its adaptation to extract the concept’s feature verbs using a verb-noun distributional pattern. Section 4 describes the experimental results and is followed by the conclusion and suggestions for future works.
[1]
Kenneth Ward Church,et al.
Word Association Norms, Mutual Information, and Lexicography
,
1989,
ACL.
[2]
Aapo Hyvärinen,et al.
Emergence of Linguistic Features: Independent Component Analysis of Contexts
,
2005
.
[3]
L. K. Hansen,et al.
Independent Components in Text
,
2000
.
[4]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[5]
Richard A. Harshman,et al.
Indexing by Latent Semantic Analysis
,
1990,
J. Am. Soc. Inf. Sci..
[6]
Aapo Hyvärinen,et al.
Fast and robust fixed-point algorithms for independent component analysis
,
1999,
IEEE Trans. Neural Networks.
[7]
Ella Bingham,et al.
ICA and SOM in text document analysis
,
2002,
SIGIR '02.