Robust Logistic Regression using Shift Parameters (Long Version)

Annotation errors can significantly hurt classifier performance, yet datasets are only growing noisier with the increased use of Amazon Mechanical Turk and techniques like distant supervision that automatically generate labels. In this paper, we present a robust extension of logistic regression that incorporates the possibility of mislabelling directly into the objective. Our model can be trained through nearly the same means as logistic regression, and retains its efficiency on high-dimensional datasets. Through named entity recognition experiments, we demonstrate that our approach can provide a significant improvement over the standard model when annotation errors are present.

[1]  John Wright,et al.  Dense Error Correction via L1-Minimization , 2008, 0809.0199.

[2]  S. V. N. Vishwanathan,et al.  T-logistic Regression , 2010, NIPS.

[3]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[4]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[6]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[7]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[8]  Rebecca Castano,et al.  Improving onboard analysis of Hyperion images by filtering mislabeled training data examples , 2009, 2009 IEEE Aerospace conference.

[9]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[10]  Yufeng Liu,et al.  Robust Truncated Hinge Loss Support Vector Machines , 2007 .

[11]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[12]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[13]  Carla E. Brodley,et al.  Class Noise Mitigation Through Instance Weighting , 2007, ECML.

[14]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[15]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[16]  Tom Minka,et al.  How To Grade a Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing , 2012, ICML.

[17]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[18]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[19]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[20]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[21]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[22]  Yuji Matsumoto,et al.  Detecting Errors in Corpora Using Support Vector Machines , 2002, COLING.