Intelligent fusion of structural and citation-based evidence for text classification

This paper shows how different measures of similarity derived from the citation information and the structural content (e.g., title, abstract) of the collection can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than other fusion techniques.

[1]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[2]  M. Amparo Vila,et al.  A Fuzzy Genetic Algorithm Approach to an Adaptive Information Retrieval Agent , 1999, J. Am. Soc. Inf. Sci..

[3]  Walter A. Kosters,et al.  Genetic programming for data classi cation: Re ning the search space , 2003 .

[4]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[5]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[6]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[7]  Weiguo Fan,et al.  Discovery of context-specific ranking functions for effective information retrieval using genetic programming , 2004, IEEE Transactions on Knowledge and Data Engineering.

[8]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[9]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[10]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[11]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[12]  Harris Wu,et al.  The effects of fitness functions on genetic programming-based ranking discovery for Web search: Research Articles , 2004 .

[13]  Michael D. Gordon User‐based document clustering by redescribing subject descriptions with a genetic algorithm , 1991 .

[14]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[15]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[16]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[17]  Aixin Sun,et al.  Web Classication Using Support Vector Machine , 2002 .

[18]  Kwong-Sak Leung,et al.  Data Classification Using Genetic Parallel Programming , 2003, GECCO.

[19]  Ivanoe De Falco,et al.  Discovering interesting classification rules with genetic programming , 2002, Appl. Soft Comput..

[20]  Ashwin Srinivasan,et al.  A Study of Two Sampling Methods for Analyzing Large Datasets with ILP , 1999, Data Mining and Knowledge Discovery.

[21]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[22]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[23]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[24]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[25]  Edward A. Fox,et al.  Tuning before feedback: combining ranking discovery and blind feedback for robust retrieval , 2004, SIGIR '04.

[26]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[27]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[28]  Vijay V. Raghavan,et al.  Optimal Determination of User-Oriented Clusters: An Application for the Reproductive Plan , 1987, ICGA.

[29]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[30]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[31]  Weiguo Fan,et al.  Personalization of search engine services for effective retrieval and knowledge management , 2000, ICIS.

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[34]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[35]  Weiguo Fan,et al.  A generic ranking function discovery framework by genetic programming for information retrieval , 2004, Inf. Process. Manag..

[36]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[37]  Gerald Salton,et al.  Automatic text processing , 1988 .

[38]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[39]  Edward A. Fox,et al.  Ranking function optimization for effective Web search by genetic programming: an empirical study , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[40]  Mounia Lalmas,et al.  A probabilistic description-oriented approach for categorizing web documents , 1999, CIKM '99.

[41]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[42]  Weiguo Fan,et al.  Effective information retrieval using genetic algorithms based matching functions adaptation , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[43]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[44]  Foster J. Provost,et al.  Active Learning for Class Probability Estimation and Ranking , 2001, IJCAI.