Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.

[1]  Steffen Staab,et al.  Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the Semantic Web , 2004, J. Web Semant..

[2]  Ramanathan V. Guha,et al.  A case for automated large-scale semantic annotation , 2003, J. Web Semant..

[3]  Georges Dupret,et al.  Latent concepts and the number orthogonal factors in latent semantic analysis , 2003, SIGIR.

[4]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[5]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[6]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[7]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[8]  Key-Sun Choi,et al.  Automatic thesaurus construction using Bayesian networks , 1995, CIKM '95.

[9]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[10]  William A. Woods,et al.  Conceptual Indexing: A Better Way to Organize Knowledge , 1997 .

[11]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[12]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[13]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[14]  Peter G. Anick,et al.  The paraphrase search assistant: terminological feedback for iterative information seeking , 1999, SIGIR '99.

[15]  Victoria S. Uren,et al.  Building and applying a concept hierarchy representation of a user profile , 2003, SIGIR.

[16]  David M. Pennock,et al.  Inferring hierarchical descriptions , 2002, CIKM '02.

[17]  Debapriyo Majumdar,et al.  Why spectral retrieval works , 2005, SIGIR '05.

[18]  Steffen Staab,et al.  Gimme' the context: context-driven automatic semantic annotation with C-PANKOW , 2005, WWW '05.

[19]  W. Bruce Croft,et al.  Discovering and Comparing Topic Hierarchies , 2000, RIAO.

[20]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[21]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[22]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[23]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[24]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[25]  Marti A. Hearst Automated Discovery of WordNet Relations , 2004 .

[26]  Georges Dupret,et al.  Latent Semantic Indexing with a Variable Number of Orthogonal Factors , 2004, RIAO.

[27]  Benjamin Piwowarski,et al.  Principal Components for Automatic Term Hierarchy Building , 2006, SPIRE.

[28]  Hideo Joho,et al.  Hierarchical presentation of expansion terms , 2002, SAC '02.