Automatic Construction of N-ary Tree Based Taxonomies

Hierarchies are an intuitive and effective organization paradigm for data. Of late there has been considerable research on automatically learning hierarchical organizations of data. In this paper, we explore the problem of learning n-ary tree based hierarchies of categories with no user-defined parameters. We propose a framework that characterizes a "good" taxonomy and also provide an algorithm to find it. This algorithm works completely automatically (with no user input) and is significantly less greedy than existing algorithms in literature. We evaluate our approach on multiple real life datasets from diverse domains, such as text mining, hyper-spectral analysis, written character recognition etc. Our experimental results show that not only are n-ary trees based taxonomies more "natural", but also the output space decompositions induced by these taxonomies for many datasets yield better classification accuracies as opposed to classification on binary tree based taxonomies

[1]  Stephen C. Gates,et al.  Taxonomies by the numbers: building high-performance taxonomies , 2005, CIKM '05.

[2]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[3]  Joydeep Ghosh,et al.  Hierarchical Fusion of Multiple Classifiers for Hyperspectral Data Analysis , 2002, Pattern Analysis & Applications.

[4]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[5]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[6]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[7]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[8]  Joydeep Ghosh,et al.  Investigation of the random forest framework for classification of hyperspectral data , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[9]  Joydeep Ghosh,et al.  Best-bases feature extraction algorithms for classification of hyperspectral data , 2001, IEEE Trans. Geosci. Remote. Sens..

[10]  Daphne Koller,et al.  Probabilistic Abstraction Hierarchies , 2001, NIPS.

[11]  Joydeep Ghosh,et al.  Automatically learning document taxonomies for hierarchical classification , 2005, WWW '05.

[12]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[13]  Jing Huang,et al.  An automatic hierarchical image classification scheme , 1998, MULTIMEDIA '98.

[14]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[15]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[16]  Michael Pelikan,et al.  Searching for the needle in the haystack: taxonomies, tags and targets , 2004, SIGUCCS '04.

[17]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[18]  W. Scott Spangler,et al.  Class visualization of high-dimensional data with applications , 2002, Comput. Stat. Data Anal..

[19]  Jennifer G. Dy,et al.  A hierarchical method for multi-class support vector machines , 2004, ICML.

[20]  Sam H. Kome,et al.  Hierarchical Subject Relationships in Folksonomies , 2005 .

[21]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[22]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[23]  Richard Fikes,et al.  The Ontolingua Server: a tool for collaborative ontology construction , 1997, Int. J. Hum. Comput. Stud..

[24]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[25]  W. Scott Spangler,et al.  The integration of business intelligence and knowledge management , 2002, IBM Syst. J..

[26]  Diego Sona,et al.  Bootstrapping for hierarchical document classification , 2003, CIKM '03.

[27]  Li Zhang,et al.  InfoAnalyzer: a computer-aided tool for building enterprise taxonomies , 2004, CIKM '04.

[28]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[29]  Mark A. Musen,et al.  PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment , 2000, AAAI/IAAI.

[30]  Jason D. M. Rennie,et al.  Improving Multiclass Text Classification with the Support Vector Machine , 2001 .

[31]  D. Maddison,et al.  The Tree of Life Web Project , 2007 .

[32]  David R. Karger,et al.  Scatter/Gather as a Tool for the Navigation of Retrieval Results , 1995 .