Semi-parametric analysis of multi-rater data

Datasets that are subjectively labeled by a number of experts are becoming more common in tasks such as biological text annotation where class definitions are necessarily somewhat subjective. Standard classification and regression models are not suited to multiple labels and typically a pre-processing step (normally assigning the majority class) is performed. We propose Bayesian models for classification and ordinal regression that naturally incorporate multiple expert opinions in defining predictive distributions. The models make use of Gaussian process priors, resulting in great flexibility and particular suitability to text based problems where the number of covariates can be far greater than the number of data instances. We show that using all labels rather than just the majority improves performance on a recent biological dataset.

[1]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[2]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[3]  J H Albert,et al.  Sequential Ordinal Modeling with Applications to Survival Data , 2001, Biometrics.

[4]  Valen E. Johnson,et al.  On Bayesian Analysis of Multirater Ordinal Data: An Application to Automated Essay Grading , 1996 .

[5]  Thomas Hofmann,et al.  Data Integration for Classification Problems Employing Gaussian Process Priors , 2007 .

[6]  Jim Albert,et al.  Ordinal Data Modeling , 2000 .

[7]  Mingjun Zhong,et al.  Data Integration for Classification Problems Employing Gaussian Process Priors , 2006, NIPS.

[8]  Yannick Versley Disagreement Dissected : Vagueness as a Source of Ambiguity in Nominal ( Co-) Reference , 2006 .

[9]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Mary Kathryn Cowles,et al.  Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models , 1996, Stat. Comput..

[11]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[12]  V. Johnson An alternative to traditional GPA for evaluating student performance , 1997 .

[13]  Simon Rogers,et al.  Multi-class Semi-supervised Learning with the e-truncated Multinomial Probit Gaussian Process , 2007, Gaussian Processes in Practice.

[14]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[15]  Mark Girolami,et al.  Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors , 2006, Neural Computation.

[16]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[17]  Ulf Leser,et al.  A Support Vector Machine Classifier for Gene Name Recognition , 2004 .

[18]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[19]  Dan Geiger,et al.  Asymptotic Model Selection for Naive Bayesian Networks , 2002, J. Mach. Learn. Res..

[20]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[21]  John Uebersax,et al.  Statistical Modeling of Expert Ratings on Medical Treatment Appropriateness , 1993 .

[22]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.