Active Learning for Entity Filtering in Microblog Streams

Monitoring the reputation of entities such as companies or brands in microblog streams (e.g., Twitter) starts by selecting mentions that are related to the entity of interest. Entities are often ambiguous (e.g., "Jaguar" or "Ford") and effective methods for selectively removing non-relevant mentions often use background knowledge obtained from domain experts. Manual annotations by experts, however, are costly. We therefore approach the problem of entity filtering with active learning, thereby reducing the annotation load for experts. To this end, we use a strong passive baseline and analyze different sampling methods for selecting samples for annotation. We find that margin sampling--an informative type of sampling that considers the distance to the hyperplane used for class separation--can effectively be used for entity filtering and can significantly reduce the cost of annotating initial training data.

[1]  Maria Simi,et al.  Active Learning for Building a Corpus of Questions for Parsing , 2010, LREC.

[2]  Julio Gonzalo,et al.  Overview of RepLab 2012: Evaluating Online Reputation Management Systems , 2012, CLEF.

[3]  Julio Gonzalo,et al.  UNED Online Reputation Monitoring Team at RepLab 2013 , 2013, CLEF.

[4]  Damiano Spina Valentin Entity-based filtering and topic detection For online reputation monitoring in Twitter , 2014 .

[5]  Julio Gonzalo,et al.  Learning similarity functions for topic detection in online reputation monitoring , 2014, SIGIR.

[6]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[7]  Rong Hu,et al.  Active Learning for Text Classification , 2011 .

[8]  Damiano Spina Entity-based filtering and topic detection For online reputation monitoring in Twitter , 2014 .

[9]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[10]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[11]  Julio Gonzalo,et al.  Discovering filter keywords for company name disambiguation in twitter , 2013, Expert Syst. Appl..

[12]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[13]  Maria-Hendrike Peetz,et al.  Time-aware online reputation analysis , 2015 .

[14]  Julio Gonzalo,et al.  WePS3 Evaluation Campaign: Overview of the On-line Reputation Management Task , 2010, CLEF.

[15]  Manabu Sassano,et al.  An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation , 2002, ACL.

[16]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[17]  Jingbo Zhu,et al.  A Density-Based Re-ranking Technique for Active Learning for Data Annotations , 2009, ICCPOL.

[18]  Eduardo P. Wiechmann,et al.  Active learning for clinical text classification: is it better than random sampling? , 2012, J. Am. Medical Informatics Assoc..