Learning When Concepts Abound

Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)).

[1]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[2]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[3]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  William W. Cohen,et al.  Single-pass online learning: performance, voting schemes and online feature selection , 2006, KDD '06.

[6]  Hitoshi Isahara,et al.  Efficient Text Categorization Using a Min-Max Modular Support Vector Machine , 2006 .

[7]  Narendra Ahuja,et al.  Learning to Recognize Three-Dimensional Objects , 2002, Neural Computation.

[8]  Harris Wu,et al.  Evaluating Web-based Question Answering Systems , 2002, LREC.

[9]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  C. Lee Giles,et al.  Error-driven generalist+experts (edge): a multi-stage ensemble framework for text categorization , 2008, CIKM '08.

[12]  Sudipto Guha,et al.  Space-Efficient Sampling , 2007, AISTATS.

[13]  Omid Madani,et al.  Prediction Games in Infinitely Rich Worlds , 2007, AAAI Fall Symposium: Computational Approaches to Representation Change during Learning and Development.

[14]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain , 2004, Machine Learning.

[15]  Chris Mesterharm Transforming Linear-threshold Learning Algorithms into Multi-class Linear Learning Algorithms , 2001 .

[16]  Christiane Fellbaum,et al.  Performance And Confidence In A Semantic Annotation Task , 1998 .

[17]  Sanja Fidler,et al.  Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Hrishikesh B. Aradhye,et al.  Video2Text: Learning to Annotate Video Content , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[19]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[20]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[23]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[24]  J E Hoffman Visual recognition. , 1994, Science.

[25]  SingerYoram,et al.  Context-sensitive learning methods for text categorization , 1999 .

[26]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[27]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[28]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[29]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[30]  Christian Genest,et al.  Combining Probability Distributions: A Critique and an Annotated Bibliography , 1986 .

[31]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[32]  Leslie G. Valiant,et al.  Circuits of the mind , 1994 .

[33]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[34]  A. Chapanis Handbook of Experimental Psychology. S. S. Stevens , 1952 .

[35]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[36]  Michael Biehl,et al.  The AdaTron: An Adaptive Perceptron Algorithm , 1989 .

[37]  Omid Madani,et al.  Large-Scale Many-Class Learning , 2008, SDM.

[38]  N. Kanwisher,et al.  PSYCHOLOGICAL SCIENCE Research Article Visual Recognition As Soon as You Know It Is There, You Know What It Is , 2022 .

[39]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[40]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[41]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[42]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[43]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[44]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[45]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[46]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[47]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[48]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[49]  Ming-Hsuan Yang,et al.  Learning to Recognize 3D Objects , 2000 .

[50]  John C. Platt,et al.  Online Bayes Point Machines , 2003, PAKDD.

[51]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[52]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[53]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[54]  Dan Roth,et al.  A Classification Approach to Word Prediction , 2000, ANLP.

[55]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[56]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .

[57]  Omid Madani RANKED RECALL: EFFICIENT CLASSIFICATION BY EFFICIENT LEARNING OF INDICES THAT RANK , 2007 .

[58]  Forsyth,et al.  Computer Vision , 2007 .

[59]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[60]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[61]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[62]  S. Thorpe,et al.  Speed of processing in the human visual system , 1996, Nature.

[63]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[64]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[65]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[66]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[67]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[68]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[69]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[70]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[71]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[72]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[73]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[74]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[75]  Donald Geman,et al.  A Design Principle for Coarse-to-Fine Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[76]  Susanne Albers,et al.  Self-Organizing Data Structures , 1996, Online Algorithms.

[77]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[78]  Omid Madani Exploring Massive Learning via a Prediction System , 2007, AAAI Fall Symposium: Computational Approaches to Representation Change during Learning and Development.

[79]  Chris Mesterharm A Multi-class Linear Learning Algorithm Related to Winnow , 1999, NIPS.

[80]  Jian Huang,et al.  On updates that constrain the features' connections during learning , 2008, KDD.

[81]  Eric Yeh,et al.  Efficient Online Learning and Prediction of Users' Desktop Actions , 2009, IJCAI.

[82]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[83]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[84]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[85]  Mohammad R. Salavatipour,et al.  Recall Systems: Effcient Learning and Use of Category Indices , 2007, AISTATS.