Building a dynamic classifier for large text data collections

Due to the lack of in-built tools to navigate the web, people have to use external solutions to find information. The most popular of these are search engines and web directories. Search engines allow users to locate specific information about a particular topic, whereas web directories facilitate exploration over a wider topic. In the recent past, statistical machine learning methods have been successfully exploited in search engines. Web directories remained in their primitive state, which resulted in their decline. Exploration however is a task which answers a different information need of the user and should not be neglected. Web directories should provide a user experience of the same quality as search engines. Their development by machine learning methods however is hindered by the noisy nature of the web, which makes text classifiers unreliable when applied to web data. In this paper we propose Stochastic Prior Distribution Adjustment (SPDA) - a variation of the Multinomial Naive Bayes (MNB) classifier which makes it more suitable to classify real-world data. By stochastically adjusting class prior distributions we achieve a better overall success rate, but more importantly we also significantly improve error distribution across classes, making the classifier equally reliable for all classes and therefore more usable.

[1]  Mark Reynolds,et al.  Measuring Data-Driven Ontology Changes using Text Mining , 2007, AusDM.

[2]  Satoshi Nakamura,et al.  WeBrowSearch: Toward Web Browser with Autonomous Search , 2007, WISE.

[3]  Ravi Kumar,et al.  Compressed web indexes , 2009, WWW '09.

[4]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[5]  Jonathan A. Zdziarski,et al.  Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification , 2005 .

[6]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[7]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[8]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[9]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[10]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[11]  Genevieve Gorrell,et al.  Generalized Hebbian Algorithm for Latent Semantic Analysis , 2005 .

[12]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[13]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[14]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[15]  Yeuvo Jphonen,et al.  Self-Organizing Maps , 1995 .

[16]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[17]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[18]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[19]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[20]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[21]  Hector Garcia-Molina,et al.  Web Content Categorization Using Link Information , 2006 .

[22]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[23]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[24]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[25]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[26]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[27]  Genevieve Gorrell,et al.  Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing , 2006 .

[28]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[29]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[30]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[31]  Satoshi Nakamura,et al.  Can social bookmarking enhance search in the web? , 2007, JCDL '07.

[32]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.