Feed Distillation Using AdaBoost and Topic Maps

This paper retains the experiences by participating in TREC 2007 Blog Track ‘Feed Distillation’. To perform the run various classifiers are combined, which analyze title-, contentand splog-specific features to predict the relevance of a feed related to a topic, based on the idea of AdaBoost. The implemented classifiers utilize keywords retrieved from different thesauri such as Wordnet and Wortschatz, as well as from websites providing hierarchical organized ‘ontology’ such as the ‘Open Directory Project’ and Yahoo Directory. To structure the keywords, Topic Maps are utilized according to ISO/IEC 13250:2000.

[1]  Jack Park,et al.  Charting the Topic Maps Research and Applications Landscape , 2005, Lecture Notes in Computer Science.

[2]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[3]  François Paradis Using linguistic and discourse structures to derive topics , 1995, CIKM '95.

[4]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[5]  Wallace Koehler,et al.  Information science as "Little Science":The implications of a bibliometric analysis of theJournal of the American Society for Information Science , 2001, Scientometrics.

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  Okan Yilmaz,et al.  A Case Study of Using Domain Analysis for the Conflation Algorithms Domain , 2007 .

[8]  Qiang Yang,et al.  Query enrichment for web-query classification , 2006, TOIS.

[9]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[10]  Kuo-Chen Chou,et al.  Predicting protein structural class with AdaBoost Learner. , 2006, Protein and peptide letters.

[11]  Yoram Singer,et al.  Boosting for document routing , 2000, CIKM '00.

[12]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[13]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[14]  Jian-xiong Dong,et al.  Fast SVM training algorithm with decomposition on very large data sets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[16]  J. Greenstone Relevance , 2007 .

[17]  Vasant Honavar,et al.  Learn++: an incremental learning algorithm for supervised neural networks , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[18]  Jimmy J. Lin,et al.  Integrating Web-based and Corpus-based Techniques for Question Answering , 2003, TREC.

[19]  Yi Lu Murphey,et al.  Neural learning using AdaBoost , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[20]  Brian T. Bartell,et al.  Optimizing ranking functions: a connectionist approach to adaptive information retrieval , 1994 .

[21]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[22]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[23]  Robert J. Gaizauskas,et al.  Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms , 2004, TREC.

[24]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[25]  Iadh Ounis,et al.  Distribution of relevant documents in domain-level aggregates for topic distillation , 2004, WWW Alt. '04.

[26]  Klaus Obermayer,et al.  Efficient Query Delegation by Detecting Redundant Retrieval Strategies , 2007 .

[27]  Padraig Cunningham,et al.  Diversity versus Quality in Classification Ensembles Based on Feature Selection , 2000, ECML.

[28]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[29]  Daniel W. Drezner,et al.  The power and politics of blogs , 2007 .

[30]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[31]  Gilad Mishne Using Blog Properties to Improve Retrieval , 2007, ICWSM.

[32]  Heikki Mannila,et al.  Topics in 0--1 data , 2002, KDD.

[33]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[34]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[35]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[37]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[38]  Enrique Romero,et al.  Margin maximization with feed-forward neural networks: a comparative study with SVM and AdaBoost , 2004, Neurocomputing.

[39]  Yun Chi,et al.  Splog detection using self-similarity analysis on blog temporal dynamics , 2007, AIRWeb '07.

[40]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[41]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[42]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[43]  Sven Meyer zu Eissen,et al.  On Information Need and Categorizing Search , 2007, Künstliche Intell..

[44]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[45]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[46]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[47]  M. de Rijke,et al.  Identifying Facets in Query-Biased Sets of Blog Posts , 2007, ICWSM.

[48]  Rebecca Blood,et al.  How blogging software reshapes the online community , 2004, CACM.

[49]  Stephen P. Harter,et al.  Psychological Relevance and Information Science , 1992, J. Am. Soc. Inf. Sci..

[50]  Takenobu Tokunaga,et al.  Combining multiple evidence from different types of thesaurus for query expansion , 1999, SIGIR '99.

[51]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[52]  Jooyoung Park,et al.  Approximation and Radial-Basis-Function Networks , 1993, Neural Computation.

[53]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[54]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[55]  Robert E. Schapire,et al.  Theoretical Views of Boosting and Applications , 1999, ALT.

[56]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[57]  Steven R. Newcomb,et al.  Iso/iec 13250:2000 topic maps: information technology -- document description and markup language , 1999 .

[58]  M. E. Maron,et al.  On indexing, retrieval and the meaning of about , 1977, J. Am. Soc. Inf. Sci..

[59]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.