An unsupervised hierarchical approach to document categorization

We propose a hierarchical approach to document categorization that requires no pre-configuration and maps the semantic document space to a predefined taxonomy. The utilization of search engines to train a hierarchical classifier makes our approach more flexible than existing solutions which rely on (human) labeled data and are bound to a specific domain. We show that the structural information given by the taxonomy allows for a context aware construction of search queries and leads to higher tagging accuracy. We test our approach on different benchmark datasets and evaluate its performance on the single- and multi-tag assignment tasks. The experimental results show that our solution is as accurate as supervised classifiers for web page classification and still performs well when categorizing domain specific documents.

[1]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[2]  Richard T. Watson,et al.  EIS support for the strategic management process , 2002, Decis. Support Syst..

[3]  Kip Smith,et al.  Situation Awareness Is Adaptive, Externally Directed Consciousness , 1995, Hum. Factors.

[4]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Christer Carlsson,et al.  Past, present, and future of decision support technology , 2002, Decis. Support Syst..

[7]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[8]  Graham Pervan,et al.  A critical analysis of decision support systems research , 2005, J. Inf. Technol..

[9]  Sang M. Lee,et al.  An exploratory cognitive DSS for strategic decision making , 2003, Decis. Support Syst..

[10]  K.M. Sutcliffe,et al.  The high cost of accurate knowledge , 2003, IEEE Engineering Management Review.

[11]  Ching-Huei Chen,et al.  The design of a web-based cognitive modeling system to support ill-structured problem solving , 2006, Br. J. Educ. Technol..

[12]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[13]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW '02.

[14]  Shui-Lung Chuang,et al.  Liveclassifier: creating hierarchical text classifiers through web corpora , 2004, WWW '04.

[15]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[16]  Marilyn Jager Adams,et al.  Situation Awareness and the Cognitive Management of Complex Systems , 1995, Hum. Factors.

[17]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[18]  Dunja Mladenic,et al.  Mapping Documents onto Web Page Ontology , 2003, EWMF.

[19]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[20]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[21]  Sachin Agarwal,et al.  An Efficient Ontology-Based Expert Peering System , 2007, GbRPR.

[22]  E. Salas,et al.  Taking stock of naturalistic decision making , 2001 .

[23]  Feng-Yang Kuo,et al.  Managerial intuition and the development of executive support systems , 1998, Decis. Support Syst..

[24]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[25]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[26]  Mica R. Endsley,et al.  Designing for Situation Awareness : An Approach to User-Centered Design , 2003 .

[27]  Xiaogang Peng,et al.  Automatic web page classification in a dynamic and hierarchical way , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[29]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.