Intelligent GP fusion from multiple sources for text classification

This paper shows how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity -- five derived from the citation information of the collection, and three derived from the structural content -- and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than GA or other fusion techniques.

[1]  Edward A. Fox,et al.  An Architecture for Multischeming in Digital Libraries , 2003, ICADL.

[2]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[3]  Harris Wu,et al.  The effects of fitness functions on genetic programming-based ranking discovery for Web search: Research Articles , 2004 .

[4]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[5]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[6]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[7]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[8]  Chris North,et al.  Citiviz: A Visual User Interface to the CITIDEL System , 2004, ECDL.

[9]  Kwong-Sak Leung,et al.  Data Classification Using Genetic Parallel Programming , 2003, GECCO.

[10]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[11]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[12]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[13]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[14]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[15]  Walter A. Kosters,et al.  Genetic programming for data classi cation: Re ning the search space , 2003 .

[16]  Riccardo Poli,et al.  Foundations of Genetic Programming , 1999, Springer Berlin Heidelberg.

[17]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Tina Yu,et al.  Autonomous document classification for business , 1997, AGENTS '97.

[21]  M. Dolores del Castillo,et al.  A multistrategy approach for digital text categorization from imbalanced documents , 2004, SKDD.

[22]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[23]  Edward A. Fox,et al.  Intelligent fusion of structural and citation-based evidence for text classification , 2005, SIGIR '05.

[24]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[25]  William B. Langdon Data structures and genetic programming , 1995 .

[26]  Charles E. Taylor Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Complex Adaptive Systems.John H. Holland , 1994 .

[27]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[28]  Edward A. Fox,et al.  An OAI-Based Filtering Service for CITIDEL from NDLTD , 2003, ICADL.

[29]  Edward A. Fox,et al.  Ranking function optimization for effective Web search by genetic programming: an empirical study , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[30]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[31]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[32]  Weiguo Fan,et al.  Discovery of context-specific ranking functions for effective information retrieval using genetic programming , 2004, IEEE Transactions on Knowledge and Data Engineering.

[33]  Lalit M. Patnaik,et al.  Genetic programming based pattern classification with feature space partitioning , 2001, Inf. Sci..

[34]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[35]  Lalit M. Patnaik,et al.  Application of genetic programming for multicategory pattern classification , 2000, IEEE Trans. Evol. Comput..