Hierarchical document categorization using associative networks

Associative networks are a connectionist language model with the ability to handle dynamic data. We used two associative networks to categorize random sets of related Wikipedia articles with only their raw text as input. We then compared the resulting categorization to a gold standard: the manual categorization by Wikipedia authors and used a neural network as a baseline. We also determined a human rating by having a group of judges rank the four categorization methods by correctness and by usefulness with regards to finding information. Based on these tests, we determined that associative networks produce results that are clearly better than the neural network baseline, coming close to the gold standard in terms of usefulness and correctness. Furthermore, automated testing suggests these results continue to hold for large datasets.

[1]  Mounia Lalmas,et al.  Hierarchical Text Categorisation based on Neural Networks and Dempster-Shafer Theory of Evidence , 2002 .

[2]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[3]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[4]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[5]  Stefano Ferilli,et al.  Using Explicit Word Co-occurrences to Improve Term-Based Text Retrieval , 2010, IRCDL.

[6]  Chih-Ming Chen,et al.  A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection , 2005, Applied Intelligence.

[7]  Pasi Fränti,et al.  Minimum spanning tree based split-and-merge: A hierarchical clustering method , 2011, Inf. Sci..

[8]  Shady Shehata,et al.  Concept Mining: A Conceptual Understanding based Approach , 2009 .

[9]  Fabrizio Sebastiani 6 Text Categorization , 2005 .

[10]  C SchankRoger,et al.  Dynamic Memory: A Theory of Reminding and Learning in Computers and People , 1983 .

[11]  Aaron Kershenbaum,et al.  Category Levels in Hierarchical Text Categorization , 1998, EMNLP.

[12]  D. R. Fulkerson,et al.  Flows in Networks. , 1964 .

[13]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[14]  ChenChih-Ming,et al.  A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection , 2005 .

[15]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[16]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[17]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  G. Marcus The Algebraic Mind: Integrating Connectionism and Cognitive Science , 2001 .

[20]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[21]  Hyung Jeong Yang,et al.  Hierarchical document categorization with k-NN and concept-based thesauri , 2006, Inf. Process. Manag..

[22]  Jae Dong Yang,et al.  Hierarchical text categorization using fuzzy relational thesaurus , 2003, Kybernetika.

[23]  A. Wichert,et al.  Hierarchical Categorization , 2003 .

[24]  W. Bechtel Connectionism and the Philosophy of Mind: An Overview , 2010 .

[25]  Niels Bloom Using Natural Language Processing to Improve Document Categorization with Associative Networks , 2012, NLDB.

[26]  Roger C. Schank,et al.  SCRIPTS, PLANS, GOALS, AND UNDERSTANDING , 1988 .

[27]  LiTao,et al.  Hierarchical document classification using automatically generated hierarchy , 2007 .

[28]  Yoshimi Suzuki,et al.  Cluster Labelling based on Concepts in a Machine-Readable Dictionary , 2011, IJCNLP.

[29]  S. Dumais Latent Semantic Analysis. , 2005 .

[30]  Chris H. Q. Ding,et al.  Cluster merging and splitting in hierarchical clustering algorithms , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..