We’re Not in Kansas Anymore: Detecting Domain Changes in Streams

Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention - detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses A-distance, a metric for detecting shifts in data streams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.

[1]  George F. Foster,et al.  Confidence estimation for NLP applications , 2006, TSLP.

[2]  Koby Crammer,et al.  Active Learning with Confidence , 2008, ACL.

[3]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[4]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[5]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[6]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[7]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[8]  S. Muthukrishnan,et al.  Sequential Change Detection on Data Streams , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[9]  Maya R. Gupta,et al.  Part-of-speech histograms for genre classification of text , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[11]  Daumé,et al.  Domain Adaptation meets Active Learning , 2010, HLT-NAACL 2010.

[12]  Koby Crammer,et al.  Online Methods for Multi-Domain Learning and Adaptation , 2008, EMNLP.

[13]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[14]  Aidan Finn,et al.  Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[15]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[16]  Kyosuke Nishida,et al.  Learning and Detecting Concept Drift , 2008 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[19]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[20]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[21]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[22]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[23]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[24]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[25]  Carsten Lanquillon Information Filtering in Changing Domains , 1999, IJCAI 1999.

[26]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[27]  Eugene Charniak,et al.  Automatic Domain Adaptation for Parsing , 2010, NAACL.

[28]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[29]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[30]  Razvan C. Bunescu Learning with Probabilistic Features for Improved Pipeline Models , 2008, EMNLP.

[31]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[32]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[33]  Eugene Agichtein Confidence Estimation Methods for Partially Supervised Information Extraction , 2006, SDM.