论文信息 - Mining StackOverflow to Filter Out Off-Topic IRC Discussion

Mining StackOverflow to Filter Out Off-Topic IRC Discussion

Internet Relay Chat (IRC) is a commonly used tool by Open Source developers. Developers use IRC channels to discuss programming related problems, but much of the discussion is irrelevant and off-topic. Essentially if we treat IRC discussions like email messages, and apply spam filtering, we can try to filter out the spam (the off-topic discussions) from the ham (the programming discussions). Yet we need labelled data that unfortunately takes time to curate. To avoid costly cur ration in order to filter out off-topic discussions, we need positive and negative data-sources. On-line discussion forums, such as Stack Overflow, are very effective for solving programming problems. By engaging in open-data, Stack Overflow data becomes a powerful source of labelled text regarding programming. This work shows that we can train classifiers using Stack Overflow posts as positive examples of on-topic programming discussion. You Tube video comments, notorious for their lack of quality, serve as training set of off-topic discussion. By exploiting these datasets, accurate classifiers can be built, tested and evaluated that require very little effort for end-users to deploy and exploit.

Abram Hindle | Shaiful Alam Chowdhury

[1] Paul Resnick,et al. Follow the reader: filtering comments on slashdot , 2007, CHI.

[2] Philip S. Yu,et al. Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[3] Kashif Javed,et al. A two-stage Markov blanket based feature selection algorithm for text classification , 2015, Neurocomputing.

[4] Charu C. Aggarwal,et al. A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[5] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[6] Megan Squire,et al. FLOSS as a Source for Profanity and Insults: Collecting the Data , 2015, 2015 48th Hawaii International Conference on System Sciences.

[7] Eibe Frank,et al. Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[8] Laura Schweitzer,et al. Advances In Kernel Methods Support Vector Learning , 2016 .

[9] Ahmed E. Hassan,et al. On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[10] Xijin Tang,et al. TESC: An approach to TExt classification using Semi-supervised Clustering , 2015, Knowl. Based Syst..

[11] Dazhe Zhao,et al. An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[12] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.