Robust Models in Information Retrieval

Classification tasks in information retrieval deal with document collections of enormous size, which makes the ratio between the document set underlying the learning process and the set of unseen documents very small. With a ratio close to zero, the evaluation of a model-classifier-combination's generalization ability with leave-n-out-methods or cross-validation becomes unreliable: The generalization error of a complex model (with a more complex hypothesis structure) might underestimated compared to the generalization error of a simple model (with a less complex hypothesis structure). Given this situation, optimizing the bias-variance-tradeoff to select among these models will lead one astray. To address this problem we introduce the idea of robust models, where one intentionally restricts the hypothesis structure within the model formation process. We observe that -- despite the fact that such a robust model entails a higher test error -- its efficiency "in the wild" outperforms the model that would have been chosen normally, under the perspective of the best bias-variance-tradeoff. We present two case studies: (1) a categorization task, which demonstrates that robust models are more stable in retrieval situations when training data is scarce, and (2) a genre identification task, which underlines the practical relevance of robust models.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  Sham M. Kakade,et al.  An Information Theoretic Framework for Multi-view Learning , 2008, COLT.

[3]  Adele E. Howe,et al.  Effects of web document evolution on genre classification , 2005, CIKM '05.

[4]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[5]  Huan Liu,et al.  Feature subset selection bias for classification learning , 2006, ICML.

[6]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[10]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[11]  Tony R. Martinez,et al.  Bias and the probability of generalization , 1997, Proceedings Intelligent Information Systems. IIS'97.

[12]  Charles L. A. Clarke,et al.  Towards genre classification for IR in the workplace , 2006, IIiX.

[13]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[14]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[15]  Marina Santini Common Criteria for Genre Classification : Annotation and Granularity , 2006 .

[16]  Sven Meyer Genre Classification of Web Pages User Study and Feasibility Analysis , 2004 .

[17]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[20]  Giorgio Valentini,et al.  Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods , 2004, J. Mach. Learn. Res..

[21]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[22]  Cullen Schaffer Overfitting avoidance as bias , 2004, Machine Learning.

[23]  Benno Stein,et al.  Retrieval Models for Genre Classification , 2008, Scand. J. Inf. Syst..

[24]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[25]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[26]  Kagan Tumer,et al.  Classifier ensembles: Select real-world applications , 2008, Inf. Fusion.

[27]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..