An empirical study on the risks of using off-the-shelf techniques for processing mailing list data

Mailing list repositories contain valuable information about the history of a project. Research is starting to mine this information to support developers and maintainers of long-lived software projects. However, such information exists as unstructured data that needs special processing before it can be studied. In this paper, we identify several challenges that arise when using off-the-shelf techniques for processing mailing list data. Our study highlights the importance of proper processing of mailing list data to ensure accurate research results.

[1]  Audris Mockus,et al.  An empirical study of global software development: distance and speed , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[2]  Barrie McCombs Microsoft outlook. , 2008, Canadian journal of rural medicine : the official journal of the Society of Rural Physicians of Canada = Journal canadien de la medecine rurale : le journal officiel de la Societe de medecine rurale du Canada.

[3]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[4]  Gail C. Murphy,et al.  Hipikat: recommending pertinent software development artifacts , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[5]  William W. Cohen,et al.  Learning to Extract Signature and Reply Lines from Email , 2004, CEAS.

[6]  Gregorio Robles,et al.  Impact of libre software tools and methods in the robotics field , 2005, ACM SIGSOFT Softw. Eng. Notes.

[7]  Michael Gertz,et al.  Mining email social networks in Postgres , 2006, MSR '06.

[8]  Stephan Diehl,et al.  Small patches get in! , 2008, MSR '08.

[9]  Gregorio Robles,et al.  The processes of joining in global distributed software projects , 2006, GSD '06.

[10]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types , 1996, RFC.

[11]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[12]  Sally Hambridge,et al.  Netiquette Guidelines , 1995, RFC.

[13]  Jesús M. González-Barahona,et al.  Tools for the Study of the Usual Data Sources found in Libre Software Projects , 2009, Int. J. Open Source Softw. Process..

[14]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[15]  Jesús M. González-Barahona,et al.  Developer identification methods for integrated data from various sources , 2005, ACM SIGSOFT Softw. Eng. Notes.

[16]  Ahmed E. Hassan,et al.  What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[17]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.