Effective Listings of Function Stop words for Twitter

Many words in documents recur very frequently but are essentially meaningless as they are used to join words together in a sentence. It is commonly understood that stop words do not contribute to the context or content of textual documents. Due to their high frequency of occurrence, their presence in text mining presents an obstacle to the understanding of the content in the documents. To eliminate the bias effects, most text mining software or approaches make use of stop words list to identify and remove those words. However, the development of such top words list is difficult and inconsistent between textual sources. This problem is further aggravated by sources such as Twitter which are highly repetitive or similar in nature. In this paper, we will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for the Twitter data source. We propose a new technique using combinatorial values as an alternative measure to effectively list out stop words.

[1]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[2]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[3]  Stuart Watt,et al.  Organisational Information Management and Knowledge Discovery in Email within Mailing Lists , 2002, IDEAL.

[4]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[5]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[6]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[7]  Alan D. Marwick,et al.  Knowledge management technology , 2001, IBM Syst. J..

[8]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[9]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[10]  Manu Konchady Text Mining Application Programming , 2006 .

[11]  Bernardete Ribeiro,et al.  The importance of stop word removal on recall values in text categorization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[12]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[13]  Catherine Blake,et al.  Text mining , 2011, Annu. Rev. Inf. Sci. Technol..

[14]  Elio Masciari,et al.  Towards An Adaptive Mail Classifier , 2002 .

[15]  Wonjin Jung,et al.  A review of research: an investigation of the impact of data quality on decision performance , 2004, ISICT.

[16]  Shichao Zhang,et al.  Information enhancement for data mining , 2011, WIREs Data Mining Knowl. Discov..

[17]  Alina Mungiu-Pippidi,et al.  Moldova's "Twitter Revolution" , 2009 .

[18]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[19]  P. Sri Jothi,et al.  Analysis of social networking sites: A study on effective communication strategy in developing brand communication , 2011 .

[20]  A. Strauss,et al.  Grounded theory , 2017 .

[21]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[22]  Zhou Yao,et al.  Research on the Construction and Filter Method of Stop-word List in Text Preprocessing , 2011, 2011 Fourth International Conference on Intelligent Computation Technology and Automation.

[23]  A. Strauss,et al.  The Discovery of Grounded Theory , 1967 .

[24]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[25]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[26]  Sven Schmeier,et al.  Message Classification in the Call Center , 2000, ANLP.

[27]  David W. Corne,et al.  Evolving Better Stoplists for Document Clustering and Web Intelligence , 2003, HIS.

[28]  Xiaotie Deng,et al.  Automatic construction of Chinese stop word list , 2006 .

[29]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[30]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[31]  Giri Kumar Tayi,et al.  Examining data quality , 1998, CACM.

[32]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .