Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction

Multi-view learners reduce the need for labeled data by exploiting disjoint sub-sets of features (views), each of which is sufficient for learning. Such algorithms assume that each view is a strong view (i.e., perfect learning is possible in each view). We extend the multi-view framework by introducing a novel algorithm, Aggressive Co-Testing, that exploits both strong and weak views; in a weak view, one can learn a concept that is strictly more general or specific than the target concept. Aggressive Co-Testing uses the weak views both for detecting the most informative examples in the domain and for improving the accuracy of the predictions. In a case study on 33 wrapper induction tasks, our algorithm requires significantly fewer labeled examples than existing state-of-the-art approaches.

[1]  Richard E. Korf,et al.  From Approximate to Optimal Solutions: A Case Study of Number Partitioning , 1995, IJCAI.

[2]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[3]  Toby Walsh,et al.  Analysis of Heuristics for Number Partitioning , 1998, Comput. Intell..

[4]  Kristina Lerman,et al.  Learning the Common Structure of Data , 2000, AAAI/IAAI.

[5]  Peter C. Cheeseman,et al.  Where the Really Hard Problems Are , 1991, IJCAI.

[6]  Raymond J. Mooney,et al.  A Mutually Beneficial Integration of Data Mining and Information Extraction , 2000, AAAI/IAAI.

[7]  Hector J. Levesque,et al.  Hard and Easy Distributions of SAT Problems , 1992, AAAI.

[8]  Toby Walsh,et al.  Phase Transitions and Annealed Theories: Number Partitioning as a Case Study , 1996, ECAI.

[9]  Craig A. Knoblock,et al.  Adaptive View Validation: A First Step Towards Automatic View Detection , 2002, ICML.

[10]  Toby Walsh,et al.  The TSP Phase Transition , 1996, Artif. Intell..

[11]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[12]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[13]  M. Trick,et al.  The computational difficulty of manipulating an election , 1989 .

[14]  J. Culberson,et al.  The Gn,m Phase Transition is Not Hard for the Hamiltonian Cycle Problem , 1998, J. Artif. Intell. Res..

[15]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[16]  Craig A. Knoblock,et al.  The Ariadne Approach to Web-Based Information Integration , 2001, Int. J. Cooperative Inf. Syst..

[17]  Jakub Zavrel,et al.  Information Extraction by Text Classification: Corpus Mining for Features , 2000 .

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[20]  Jeremy Frank,et al.  Asymptotic and Finite Size Parameters for Phase Transitions: Hamiltonian Circuit as a Case Study , 1998, Inf. Process. Lett..

[21]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[22]  Toby Walsh,et al.  Scaling Effects in the CSP Phase Transition , 1995, CP.

[23]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[24]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[25]  Wei Li,et al.  The SAT phase transition , 1999, ArXiv.

[26]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[27]  John J. Bartholdi,et al.  Single transferable vote resists strategic voting , 2015 .

[28]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  G. S. Lueker,et al.  Probabilistic analysis of optimum partitioning , 1986, Journal of Applied Probability.

[31]  A. Borodin,et al.  Threshold phenomena in random graph colouring and satisfiability , 1999 .